I read here that people have already devised various supplemental strategies (post-ocr processing of the text, etc.)
It occurs to me that obtaining a box file that corresponds to the actual recognition result would open up new possibilities for supplemental processing. You may have special cases where some characters that (potentially) serve as some type of delimiters, are recognized perfectly (on initial pass), while the content that they delimit is problematic. In my post-processing, I could take certain advantage of this, with the help of box coordinates information, by extracting the "delimited" regions from my image file and re- recognizing them, perhaps with different settings, different trained data sets, or even doing special PRE-processing just for those regions (one threshold value, for instance, works better for bold text, but much worse for italics; another value is just the opposite) Other potential uses for an end-result box file: You might, for instance, wish to experiment with extracting illustrations (images) in some projects, where each illustration happens to have a consistent caption beneath it -- like "Fig. 1", "Fig 2", etc. Box information for the letter "F" in that word would help determine the bottom boundary, and box information for line above the caption line could help determine the upper boundary of the illustration region. In fact, for some material, you might be able to determine with certainty that a line is a caption to an illustration, or just that an illustration exists, by computing the difference between that lines y-value and that of the previous line's y-value (without even needing a consistent patter in the caption text) Could someone please help manifest the box info in the end result? (with some user option in the interface) thanks! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

