Martin Pierre,
kindly see my comments in red color.

With regards,
-sriranga(77yrsold)

On Tue, Apr 13, 2010 at 1:39 PM, MARTIN Pierre <[email protected]> wrote:

> Hello again Sriranga,
>
> A sepparate eMail for another subject, so you can get better results on
> your recognition. This is also help for newcomers who are wondering how to
> "optimize" a picture before feeding it to Tess.
> i've empirically learned that a lot of the recognition results are based on
> the "quality" of image you feed the program.
> i've myself wrote image processing filters based on various existing
> algorythms (Cumulative histogram -> equalization to allow non-perfect white
> background + non-perfect black characters to be automatically equalized to
> pure white and black) and such things.
>
    *>it would be helpful to explain by sreenshots, if there is no
objections.
 *

> i've also learned that the parametters i have to give to those routines
> needs to be determined per document-type (If a book's page has a certain
> background, it's most likely to be the same for all the book, so all pages
> shares the same "type").
> .> *Yes you are correct*
>



> Since i'm working for a corporation which doesn't allows me to share my
> work if it's not on the tesseract source code, i'm not really allowed to
> give you source code, but here are directions, if someone else can help you
> code:
> *> Understand your problem of  not sharing your work*. *As such  need not
> to worry*
>



> - Build an histogram (Regular one). Typically it "counts" pixels for each
> gray level. Given that there are 255 possible values, an histogram is just
> an array of 255 elements, all at zero at the begining of the function. At
> the end of the function, if you make the sum of all the 255 elements, you'll
> have the pixel count of your picture (Since they will be "ventilated" in
> your histogram elements).
> - Based on that, make a cumulative histogram. Basically, you take the first
> array, and instead of having the count of each pixel of a color in each
> element, you have to sum the values (So if element n°0 which is the color
> you're counting has 30 pixels, and if color level 1 has 20 pixels, then your
> cumulative histogram has to count 50 in element 1. Then if color level 2 has
> 5 pixels, histogram element n°2 will be 55, and so on until you reach the
> 255th element (White)).
> - Equalize your image with the hitogram like this: let's say k is the first
> non-nul element of your cumulative histogram (So if element 0, 1 and 2 are
> 0, but if element 3 is 10, then k = 3), pc is your pixel count (in most
> cases image width * image height) then loop for each pixel, and compute cdf1
> = cumulativeHistogram[currentPixelLevel] - k, and then compute cdf2 = pc -
> k; then your current pixel new value should be cdf1-cdf2*255. You'll have a
> very nice "contrasted" picture with equally redistributed colors.
> - Now, transforming it to a clean black and white (And do not confuse b&w
> with grayscale, it's different, Tesseract prefers b&w), and to do this you
> just have to determine the gray level threshold for which you'll consider a
> pixel below it to be black, and above white. This threshold is
> type-specific, that's what i was talking about earlier (In a book, you most
> likely will use the same threshold for all the pages since the color
> distribution will be the same on the same background).
> - Sometimes, the threshold will have to be different over different
> areas... i recomend you to make a grid cumulative histogram... But be
> carefull to not "split" characters doing that.
> > *Really interesting points. Since I am not programmer nor developer, it
> is difficult to understand the above points. Is it possible to explain
> with help of screen shots of  the sample image.*  *Generally, with help of
> Irfanview software, images are ensured that it will have 300 or 600 dpi and
> saved as  uncompressed tif.*
>
    *I hope tesseract 3.0  will have provision/feature of "auto- build
histogram"*

> Voila :)
> Pierre.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to