Hello, I would like to be able to use tesseract with only a specific set of fonts and I would like to know which font actually matched. Basically, there is only ever one font in the image but it could in principle be one of many different fonts. However, we can typically limit it to only a subset. Afterwards we would like to know which font matched.
I can only see three ways how we could combine fonts: 1. Train all fonts into one big tesseract file. This has the advantage that all fonts are always present, but at the expense of accuracy and speed. Furthermore, we don't know which font actually matched and it is not possible to add more fonts later on without redoing the entire training. 2. Figure out the tesseract file format and somehow find a way to somehow combine fonts (if this is at all possible). This would make it possible to create any combination of fonts from the individual fonts. Thus, the number of active fonts would be limited so as to improve accuracy and speed. The only disadvantage is that we wouldn't know which font matched. 3. Call tesseract separately for each font. This has the advantage of full flexibility, plus we would know which font matched. However, it goes through the entire processing chain again and again at the cost of speed. Furthermore, we are not clear how best to combine the individual results. Should we pick the result with the best confidence? Or is the rating a better indicator? Since all passes are done on the same image, the outline length should be identical. To improve speed, is there a way to keep the intermediate results (binarization, connected components, baseline finding, extraction of features, etc.) for later passes? Is there another way I may have missed? Does tesseract already offer some way to apply more than one font? Thanks in advance for any help you can provide. Best regards, Marcus -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

