I read here that people have already devised various supplemental
strategies (post-ocr processing of the text, etc.)

It occurs to me that obtaining a box file that corresponds to the
actual recognition result would open up new possibilities for
supplemental processing.  You may have special cases where some
characters that (potentially) serve as some type of delimiters, are
recognized perfectly (on initial pass), while the content that they
delimit is problematic.  In my post-processing, I could take certain
advantage of this, with the help of box coordinates information, by
extracting the "delimited" regions from my image file and re-
recognizing them, perhaps with different settings, different trained
data sets, or even doing special PRE-processing just for those regions
(one threshold value, for instance, works better for bold text, but
much worse for italics; another value is just the opposite)

Other potential uses for an end-result box file: You might, for
instance, wish to experiment with extracting illustrations (images) in
some projects, where each illustration happens to have a consistent
caption beneath it -- like "Fig. 1", "Fig 2", etc. Box information for
the letter "F" in that word would help determine the bottom boundary,
and box information for line above the caption line could help
determine the upper boundary of the illustration region.  In fact, for
some material, you might be able to determine with certainty that a
line is a caption to an illustration, or just that an illustration
exists, by computing the difference between that lines y-value and
that of the previous line's y-value (without even needing a consistent
patter in the caption text)

Could someone please help manifest the box info in the end result?
(with some user option in the interface)

thanks!







-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to