Martin Pierre, kindly see my comments in red color. With regards, -sriranga(77yrsold)
On Tue, Apr 13, 2010 at 1:39 PM, MARTIN Pierre <[email protected]> wrote: > Hello again Sriranga, > > A sepparate eMail for another subject, so you can get better results on > your recognition. This is also help for newcomers who are wondering how to > "optimize" a picture before feeding it to Tess. > i've empirically learned that a lot of the recognition results are based on > the "quality" of image you feed the program. > i've myself wrote image processing filters based on various existing > algorythms (Cumulative histogram -> equalization to allow non-perfect white > background + non-perfect black characters to be automatically equalized to > pure white and black) and such things. > *>it would be helpful to explain by sreenshots, if there is no objections. * > i've also learned that the parametters i have to give to those routines > needs to be determined per document-type (If a book's page has a certain > background, it's most likely to be the same for all the book, so all pages > shares the same "type"). > .> *Yes you are correct* > > Since i'm working for a corporation which doesn't allows me to share my > work if it's not on the tesseract source code, i'm not really allowed to > give you source code, but here are directions, if someone else can help you > code: > *> Understand your problem of not sharing your work*. *As such need not > to worry* > > - Build an histogram (Regular one). Typically it "counts" pixels for each > gray level. Given that there are 255 possible values, an histogram is just > an array of 255 elements, all at zero at the begining of the function. At > the end of the function, if you make the sum of all the 255 elements, you'll > have the pixel count of your picture (Since they will be "ventilated" in > your histogram elements). > - Based on that, make a cumulative histogram. Basically, you take the first > array, and instead of having the count of each pixel of a color in each > element, you have to sum the values (So if element n°0 which is the color > you're counting has 30 pixels, and if color level 1 has 20 pixels, then your > cumulative histogram has to count 50 in element 1. Then if color level 2 has > 5 pixels, histogram element n°2 will be 55, and so on until you reach the > 255th element (White)). > - Equalize your image with the hitogram like this: let's say k is the first > non-nul element of your cumulative histogram (So if element 0, 1 and 2 are > 0, but if element 3 is 10, then k = 3), pc is your pixel count (in most > cases image width * image height) then loop for each pixel, and compute cdf1 > = cumulativeHistogram[currentPixelLevel] - k, and then compute cdf2 = pc - > k; then your current pixel new value should be cdf1-cdf2*255. You'll have a > very nice "contrasted" picture with equally redistributed colors. > - Now, transforming it to a clean black and white (And do not confuse b&w > with grayscale, it's different, Tesseract prefers b&w), and to do this you > just have to determine the gray level threshold for which you'll consider a > pixel below it to be black, and above white. This threshold is > type-specific, that's what i was talking about earlier (In a book, you most > likely will use the same threshold for all the pages since the color > distribution will be the same on the same background). > - Sometimes, the threshold will have to be different over different > areas... i recomend you to make a grid cumulative histogram... But be > carefull to not "split" characters doing that. > > *Really interesting points. Since I am not programmer nor developer, it > is difficult to understand the above points. Is it possible to explain > with help of screen shots of the sample image.* *Generally, with help of > Irfanview software, images are ensured that it will have 300 or 600 dpi and > saved as uncompressed tif.* > *I hope tesseract 3.0 will have provision/feature of "auto- build histogram"* > Voila :) > Pierre. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

