On Mon, Jul 8, 2013 at 3:54 PM, Kurt Marek <[email protected]> wrote: > I'm new to tesseract, so please excuse my naiveté. I'm trying to scan some > newspaper headlines, but I don't need the text in the body of the articles. > Obviously, the headline is a much larger type and a different font. Running > tesseract in default page segmentation mode usually does a good job of > recognizing the main body text, but a poor job on the headline. I'm > thinking that if I can separate out the blocks for the headline, the body > text, and any nearby images, that I could just perform the recognition on > the headline and it might work better (and faster). I can always position > the headline at the top left of the image, so it will be first in reading > order. I've tried to read through the code and figure out how to only focus > on the headline block, but I'm a little lost. Will GetComponentImages work? > Am I barking up the wrong tree?
Depending on your particular images, this might be a job for Binary Morphology. See this nice page [1] to get an idea of what can be done with Image Processing. It uses Matlab but probably everything can be done with any image processing package. For your problem I'd try: 1) Deskew the image 2) Do a binary morphology "close" operation (see "the closing of an image" example in "Morphological Image Processing" [2], to see what happens to some text after it is "closed" with a 2×2 structuring element). You'll have to experiment with the size/shape of the structuring element, but you may be able to get all the smaller text to turn into a few big black blobs. 3) Then use connected components (filtering by size of the components) to get only the areas that use to be smaller text and create a mask from that. 4) Remove all the stuff in the mask from the original image. If you aren't a C/C++ programmer you might be able to do this using ImageMagick [3]. Otherwise, the Leptonica Image Processing Library [4] is already included with Tesseract. See its pixDeskew() [5], pixMorphSequence() [6], pixConnComp() [7] functions. pixMorphSequence() is particularly easy to use, for example: pixOut = pixMorphSequence(pixIn, "c2.2", 0); will do a 2x2 Closing. See "Removing dark lines from a light pencil drawing" [8] for a Leptonica example in the same spirit as the first Matlab link. [1] http://blogs.mathworks.com/steve/2010/10/08/the-two-amigos/ [2] http://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures/ImageProcessing-html/topic4.htm [3] http://www.imagemagick.org/Usage/morphology/#basic [4] http://tpgit.github.io/UnOfficialLeptDocs/leptonica/index.html [5] http://tpgit.github.io/Leptonica/skew_8c_source.html#l00128 [6] http://tpgit.github.io/Leptonica/morphseq_8c_source.html#l00046 [7] http://tpgit.github.io/Leptonica/conncomp_8c_source.html#l00114 [8] http://tpgit.github.io/UnOfficialLeptDocs/leptonica/line-removal.html -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

