Re: Finding the first block of text?

TP Wed, 10 Jul 2013 23:01:35 -0700

On Mon, Jul 8, 2013 at 3:54 PM, Kurt Marek <[email protected]> wrote:

> I'm new to tesseract, so please excuse my naiveté. I'm trying to scan some
> newspaper headlines, but I don't need the text in the body of the articles.
> Obviously, the headline is a much larger type and a different font. Running
> tesseract in default page segmentation mode usually does a good job of
> recognizing the main body text, but a poor job on the headline. I'm
> thinking that if I can separate out the blocks for the headline, the body
> text, and any nearby images, that I could just perform the recognition on
> the headline and it might work better (and faster). I can always position
> the headline at the top left of the image, so it will be first in reading
> order. I've tried to read through the code and figure out how to only focus
> on the headline block, but I'm a little lost. Will GetComponentImages work?
> Am I barking up the wrong tree?



Depending on your particular images, this might be a job for Binary
Morphology. See this nice page [1] to get an idea of what can be done with
Image Processing. It uses Matlab but probably everything can be done with
any image processing package.

For your problem I'd try:

1) Deskew the image

2) Do a binary morphology "close" operation (see "the closing of an image"
example in "Morphological Image Processing" [2], to see what happens to
some text after it is "closed" with  a 2×2 structuring element).

   You'll have to experiment with the size/shape of the structuring
element, but you may be able to get all the smaller text to turn into a few
big black blobs.

3) Then use connected components (filtering by size of the components) to
get only the areas that use to be smaller text and create a mask from that.

4) Remove all the stuff in the mask from the original image.

If you aren't a C/C++ programmer you might be able to do this using
ImageMagick [3].

Otherwise, the Leptonica Image Processing Library [4] is already included
with Tesseract. See its pixDeskew() [5],  pixMorphSequence() [6],
pixConnComp() [7] functions. pixMorphSequence() is particularly easy to
use, for example:

   pixOut = pixMorphSequence(pixIn, "c2.2", 0);

will do a 2x2 Closing.

See "Removing dark lines from a light pencil drawing" [8] for a Leptonica
example in the same spirit as the first Matlab link.


[1] http://blogs.mathworks.com/steve/2010/10/08/the-two-amigos/

[2]
http://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures/ImageProcessing-html/topic4.htm

[3] http://www.imagemagick.org/Usage/morphology/#basic

[4] http://tpgit.github.io/UnOfficialLeptDocs/leptonica/index.html

[5] http://tpgit.github.io/Leptonica/skew_8c_source.html#l00128

[6] http://tpgit.github.io/Leptonica/morphseq_8c_source.html#l00046

[7] http://tpgit.github.io/Leptonica/conncomp_8c_source.html#l00114

[8] http://tpgit.github.io/UnOfficialLeptDocs/leptonica/line-removal.html

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Finding the first block of text?

Reply via email to