Re: Generating / Training box files for Kannada.

MARTIN Pierre Tue, 13 Apr 2010 02:01:14 -0700

Hello again Sriranga,

A sepparate eMail for another subject, so you can get better results on your 
recognition. This is also help for newcomers who are wondering how to 
"optimize" a picture before feeding it to Tess.
i've empirically learned that a lot of the recognition results are based on the 
"quality" of image you feed the program.
i've myself wrote image processing filters based on various existing algorythms 
(Cumulative histogram -> equalization to allow non-perfect white background + 
non-perfect black characters to be automatically equalized to pure white and 
black) and such things.
i've also learned that the parametters i have to give to those routines needs 
to be determined per document-type (If a book's page has a certain background, 
it's most likely to be the same for all the book, so all pages shares the same 
"type").


Since i'm working for a corporation which doesn't allows me to share my work if 
it's not on the tesseract source code, i'm not really allowed to give you 
source code, but here are directions, if someone else can help you code:

- Build an histogram (Regular one). Typically it "counts" pixels for each gray 
level. Given that there are 255 possible values, an histogram is just an array 
of 255 elements, all at zero at the begining of the function. At the end of the 
function, if you make the sum of all the 255 elements, you'll have the pixel 
count of your picture (Since they will be "ventilated" in your histogram 
elements).
- Based on that, make a cumulative histogram. Basically, you take the first 
array, and instead of having the count of each pixel of a color in each 
element, you have to sum the values (So if element n°0 which is the color 
you're counting has 30 pixels, and if color level 1 has 20 pixels, then your 
cumulative histogram has to count 50 in element 1. Then if color level 2 has 5 
pixels, histogram element n°2 will be 55, and so on until you reach the 255th 
element (White)).
- Equalize your image with the hitogram like this: let's say k is the first 
non-nul element of your cumulative histogram (So if element 0, 1 and 2 are 0, 
but if element 3 is 10, then k = 3), pc is your pixel count (in most cases 
image width * image height) then loop for each pixel, and compute cdf1 = 
cumulativeHistogram[currentPixelLevel] - k, and then compute cdf2 = pc - k; 
then your current pixel new value should be cdf1-cdf2*255. You'll have a very 
nice "contrasted" picture with equally redistributed colors.
- Now, transforming it to a clean black and white (And do not confuse b&w with 
grayscale, it's different, Tesseract prefers b&w), and to do this you just have 
to determine the gray level threshold for which you'll consider a pixel below 
it to be black, and above white. This threshold is type-specific, that's what i 
was talking about earlier (In a book, you most likely will use the same 
threshold for all the pages since the color distribution will be the same on 
the same background).
- Sometimes, the threshold will have to be different over different areas... i 
recomend you to make a grid cumulative histogram... But be carefull to not 
"split" characters doing that.

Voila :)
Pierre.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Generating / Training box files for Kannada.

Reply via email to