Hello again Sriranga, A sepparate eMail for another subject, so you can get better results on your recognition. This is also help for newcomers who are wondering how to "optimize" a picture before feeding it to Tess. i've empirically learned that a lot of the recognition results are based on the "quality" of image you feed the program. i've myself wrote image processing filters based on various existing algorythms (Cumulative histogram -> equalization to allow non-perfect white background + non-perfect black characters to be automatically equalized to pure white and black) and such things. i've also learned that the parametters i have to give to those routines needs to be determined per document-type (If a book's page has a certain background, it's most likely to be the same for all the book, so all pages shares the same "type").
Since i'm working for a corporation which doesn't allows me to share my work if it's not on the tesseract source code, i'm not really allowed to give you source code, but here are directions, if someone else can help you code: - Build an histogram (Regular one). Typically it "counts" pixels for each gray level. Given that there are 255 possible values, an histogram is just an array of 255 elements, all at zero at the begining of the function. At the end of the function, if you make the sum of all the 255 elements, you'll have the pixel count of your picture (Since they will be "ventilated" in your histogram elements). - Based on that, make a cumulative histogram. Basically, you take the first array, and instead of having the count of each pixel of a color in each element, you have to sum the values (So if element n°0 which is the color you're counting has 30 pixels, and if color level 1 has 20 pixels, then your cumulative histogram has to count 50 in element 1. Then if color level 2 has 5 pixels, histogram element n°2 will be 55, and so on until you reach the 255th element (White)). - Equalize your image with the hitogram like this: let's say k is the first non-nul element of your cumulative histogram (So if element 0, 1 and 2 are 0, but if element 3 is 10, then k = 3), pc is your pixel count (in most cases image width * image height) then loop for each pixel, and compute cdf1 = cumulativeHistogram[currentPixelLevel] - k, and then compute cdf2 = pc - k; then your current pixel new value should be cdf1-cdf2*255. You'll have a very nice "contrasted" picture with equally redistributed colors. - Now, transforming it to a clean black and white (And do not confuse b&w with grayscale, it's different, Tesseract prefers b&w), and to do this you just have to determine the gray level threshold for which you'll consider a pixel below it to be black, and above white. This threshold is type-specific, that's what i was talking about earlier (In a book, you most likely will use the same threshold for all the pages since the color distribution will be the same on the same background). - Sometimes, the threshold will have to be different over different areas... i recomend you to make a grid cumulative histogram... But be carefull to not "split" characters doing that. Voila :) Pierre. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

