agreed, tesseract was not debugged or developed in the manner you speak. Tiff( leptonica library ) is the image structure chosen for many reasons, one of which is its multi page format. The better you understand leptonica the better your usage of tesseract will be.
when tesseract trains is calls a function and passes a PIX structure. At that time the function has only visibility on this single ( multi page tiff structure ) from which it can optimize and build its complex knowledge of that font. What you have done will create a paradox. If any of the fonts you have provided appear to match a word it will assume the word is limited to that font which some in your case have many characters to choose from and some that have very few. In the case of a match with a font with very limited characters to chose from the word will be completely scrambled. On Fri, Oct 5, 2012 at 5:53 PM, Quan Nguyen <[email protected]> wrote: > Instead of concatenating the .tr files, you can merge all your images, if > they all have the same font style, into a multi-page TIFF and train with > that. You can use > jTessBoxEditor<http://vietocr.sourceforge.net/training.html>to merge images > and edit the box file. > > > On Monday, October 1, 2012 2:49:45 AM UTC-5, Speedy wrote: > >> Hello, >> >> I am trying to figure out exactly what effect the font_properties file >> has. >> >> I have already performed a number of trainings with great success. >> However, there are a few letter confusions that dominate the error rate and >> which I would like to reduce. >> >> Here is the setup: There really is only one font. My training samples are >> spread over ten separate single-page TIFF files from which I have created >> ten separate .tr files. Some of these contain very many characters, some >> only a few (typically the rare ones to make sure they are present). At >> first I have created a font_properties file with ten lines each containing >> the name of one .tr file (the flags were all set to zero). This gave pretty >> good results. Then I remembered that all samples of the same font are >> recommended to be in one .tr file. My guess was that I had needlessly let >> tesseract try to distinguish ten identical fonts. So following the >> recommendation I concatenated all .tr files into one and reduced the >> font_properties file to just one line. This however actually gave worse >> results! So how do the font_properites affect training? >> >> Thank in advance for any help! >> >> Marcus >> > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

