Hi Andreas, > in my blog-post (http://art1pirat.blogspot.de/2013/06/ > teil-9-selbstversuch-ebook-befreiung-am.html) I analyzed the improvements of > various options to increase word based recognition quality. > > In summary the enable_new_segsearch=0 and textord_old_baselines=0 increased > the > ratio from 86% to 93% if both combined.
Interesting, thanks for the information. I also found the enable_new_segsearch helped with grc, but the textord_old_baselines often made things slightly worse. Did you test with a wide range of documents? Because it's important not to get carried away optimising for one thing, only to find that in many other cases the recognition is worse. I have quite a few different scans I test against now, which you can see here if you're curious: http://gitorious.org/ancient-greek-training-for-tesseract/grctestfodder > I am also interested in useful tips which options (combined with others) will > have also a positive effect to the recognition quality of pages in german > fraktur. By far the biggest improvement for me was with line segmentation. That's largely because of the high frequency of accents above the main line, so may not apply with your German Fraktur training. But I would certainly recommend you check using the hOCR output that the lines and characters are being segmented perfectly. I intend to write up how the line segmentation works soon, including the configuration variables that affect it. In the meantime you can see my configuration at http://gitorious.org/ancient-greek-training-for-tesseract/grctraining/blobs/master/grc.config though it's quite specific to Ancient Greek. > I was surprised, that language_model_ngram=1 decreases recognition quality > from > 86% to 36%. Could you give me some explanations what was wrong? If you can read C++ you should look at how it's used by tesseract and let us know ;) Nick -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

