You cannot do this with the stock Tesseract. A specifically designed image
processing pipeline needs to be implemented to extract text for subsequent
recognition by Tesseract.


Warm regards,
Dmitri Silaev
www.CustomOCR.com


On Tue, Feb 19, 2013 at 12:05 PM, Romeo Jihara <[email protected]> wrote:

> Sorry for "uping" the post like this... But I really need some help  ASAP!
> Any guesses? At least something about the parameters?
>
> Thanks a lot!
>
> - Romeo
>
> Em sexta-feira, 15 de fevereiro de 2013 10h07min40s UTC-8, Romeo Jihara
> escreveu:
>
>> Hi all,
>>
>> I am trying to detect text that is overlaid on top of images. A common
>> example is memes like the ones here: 
>> http://www.quickmeme.com/**memes/<http://www.quickmeme.com/memes/>
>> The goal is to produce a high quality bounding box prediction and, if
>> possible, generate OCR. Please note that I'm much more interested in the
>> former!
>> I am trying to use Tesseract for that.
>>
>> What makes the problem challenging is that the background can be
>> anything. In addition the text can have a stroke and a fill of arbitrary
>> color.
>> My questions are:
>> 1) Tesseract has tons of different parameters. What is a set of important
>> parameters to tune for this case and what are good values for them?
>> 2) How do I preprocess the image? I was a bit surprised to find out that
>> converting the image to grayscale before passing it to Tesseract results in
>> different (and generally better) accuracy. Why? Also inverting the image
>> works better for some text. What are the set of important transformations
>> to play with?
>> 3) I noticed that often Tesseract is able to detect sequences of words
>> but not combine them together. What parameter affects the probability of
>> combining adjacent words together.
>> 4) Is it worth doing morphological transformations, such as trying to get
>> rid of the text stroke, or does Tesseract handle text strokes?
>> 5) When I call getRegions does it also perform OCR to give me better
>> confidence predictions of the text boxes?
>> 6) Does Tesseract use the OCR output in determining the confidence of a
>> region being true text? Looking at the results I get it seems like it is
>> possible to improve the next confidence by building an n-gram model. Also
>> some characters (like punctuation points) are highly indicative of false
>> positive text regions. Is there such built-in functionality or should I
>> build one?
>> Similarly the size and relative locations of text can also be used to
>> refine the confidence. It appears from my tests that often small and
>> disjoint text areas (and ones that are not horizontally aligned with
>> others) are false positives. Again, is there such built-in heuristic or
>> should I build one?
>>
>> I am attaching a couple of examples that show the text localization
>> results whit different preprocessing applied to the image. The numbers
>> inside each box is the confidence for that region, also blue boxes means
>> confidence > 75  and red boxes <= 75. I'm also sending the parameters used
>> in all these detections.
>>
>> Thanks for your time and for building such an awesome free OCR engine!
>>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to