You cannot do this with the stock Tesseract. A specifically designed image processing pipeline needs to be implemented to extract text for subsequent recognition by Tesseract.
Warm regards, Dmitri Silaev www.CustomOCR.com On Tue, Feb 19, 2013 at 12:05 PM, Romeo Jihara <[email protected]> wrote: > Sorry for "uping" the post like this... But I really need some help ASAP! > Any guesses? At least something about the parameters? > > Thanks a lot! > > - Romeo > > Em sexta-feira, 15 de fevereiro de 2013 10h07min40s UTC-8, Romeo Jihara > escreveu: > >> Hi all, >> >> I am trying to detect text that is overlaid on top of images. A common >> example is memes like the ones here: >> http://www.quickmeme.com/**memes/<http://www.quickmeme.com/memes/> >> The goal is to produce a high quality bounding box prediction and, if >> possible, generate OCR. Please note that I'm much more interested in the >> former! >> I am trying to use Tesseract for that. >> >> What makes the problem challenging is that the background can be >> anything. In addition the text can have a stroke and a fill of arbitrary >> color. >> My questions are: >> 1) Tesseract has tons of different parameters. What is a set of important >> parameters to tune for this case and what are good values for them? >> 2) How do I preprocess the image? I was a bit surprised to find out that >> converting the image to grayscale before passing it to Tesseract results in >> different (and generally better) accuracy. Why? Also inverting the image >> works better for some text. What are the set of important transformations >> to play with? >> 3) I noticed that often Tesseract is able to detect sequences of words >> but not combine them together. What parameter affects the probability of >> combining adjacent words together. >> 4) Is it worth doing morphological transformations, such as trying to get >> rid of the text stroke, or does Tesseract handle text strokes? >> 5) When I call getRegions does it also perform OCR to give me better >> confidence predictions of the text boxes? >> 6) Does Tesseract use the OCR output in determining the confidence of a >> region being true text? Looking at the results I get it seems like it is >> possible to improve the next confidence by building an n-gram model. Also >> some characters (like punctuation points) are highly indicative of false >> positive text regions. Is there such built-in functionality or should I >> build one? >> Similarly the size and relative locations of text can also be used to >> refine the confidence. It appears from my tests that often small and >> disjoint text areas (and ones that are not horizontally aligned with >> others) are false positives. Again, is there such built-in heuristic or >> should I build one? >> >> I am attaching a couple of examples that show the text localization >> results whit different preprocessing applied to the image. The numbers >> inside each box is the confidence for that region, also blue boxes means >> confidence > 75 and red boxes <= 75. I'm also sending the parameters used >> in all these detections. >> >> Thanks for your time and for building such an awesome free OCR engine! >> > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

