Hi, Have you added the fonts to font-properties file?
Try removing the 'narrow' font from your training set. Test with just one or two similar fonts and see if results are better. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Nov 22, 2014 at 7:11 AM, Ryan Dev <software.developer.r...@gmail.com > wrote: > I am trying to cover as much as I can of the latin unicode characters in > the BMP. > > What I find is that as I add more characters, the ocr results get worse. > > For example, instead of getting the correct ö I get Ö and then as I added > more characters the latest result is Ṏ. > > In otherwords, not only is it getting worse at detecting capitalization > correctly, but it is favoring more complex characters over the simpler > solutions! This is just one example, another is Ȧ instead of correctly > getting A. > > When I run a smaller set of training data I get better results (for the > trained ones, of course others are missed completely). > > Should I be trying to do smaller, multiple, traineddata files? This will > reduce performance, but I need accuracy most of all. Plus I've had problems > where confidence is reported high on incorrect result, and lower on correct > results. > > I'm using latest tesseract checkout, on Ubuntu, using the tesstrain.sh > script. > > Linked are files I'm using, a sample image, and the traineddata. Plus an > example image I ocr. > > > https://drive.google.com/folderview?id=0B5ebDnF6cn8UTVhBc25OOV9JYTg&usp=sharing > > The unicode ranges I am trying to train for at the moment are. > > 0000 - 007f Basic Latin > 0080 - 00ff Latin 1 Supplemental > 0100 - 017f Latin Ext A > 0180 - 024f Latin Ext B > 1e00 - 1eff Latin Extended Additional > 2500 - 2594 Box Draw and Box Elements > fb00 - fb06 Ligatures > > Using the following fonts for training > arial unicode ms > freeserif > liberation mono > liberation sans > liberation sans narrow condensed > liberation serif > segoe ui > > I can certainly add more if that helps, but so far adding fonts just means > it takes longer to realize how bad the trained data is. > > If you are asking why I am doing this, it is because I am trying to create > a language agnostic solution. You can see a test image in the link above, > and can see I am only looking at font glyphs, not full page ocr. > > Any suggestions/advice appreciated! > > > > > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWEoc%2BC5A4jRF2Ks_BckxDw4qFp1cM5YZzSjT%3Dosi-MhQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.