Tesseract - Line Finding Algorithm

Sai Nikhil Sat, 10 Aug 2013 04:28:50 -0700

Hi All,

This is Sai. I wanted to develop an Android App, which can detect the text 
contained in a screenshot. So, for this what I do basically is take the 
image and pre-process it first to basically remove the funky stuff, such as 
edges and other possibly removable unwanted stuff, that I can. Also, I 
adjust the dpi of the image according to the requirement of Tesseract, by 
using some interpolation methods like Cubic/Bi-Spline and enlarge the 
image. Then I wanted to supply this cleaned image to Tesseract-OCR (I found 
that it is a good Open-Source OCR engine). But, after reaching this point, 
I'm stuck at the OCR part, where Tesseract is unable to segment the page 
according to my wish. What I came to know after doing a sufficient amount 
of research is that, after Gray-Scaling and Thresholding the image, 
Tesseract basically assumes that it is the block of text on a page and 
applies its internal line finding algorithm to fit the text within some 
Base-Line and Mean-Line. I don't think this might help me in my situation 
because the text may be aligned like this 
(http://tsndiffopera.in/problem.jpg), in which case the base-line and 
mean-line fitting is different for both different blocks of text. But 
Tesseract fits both of them in the same line and because of which, the 'i' 
is being detected as 'l', many times. I have many such failure cases. So, 
my question is, "Is there any way to overcome this situation?", either by 
changing the segmentation algorithm used by Tesseract, like can I implement 
my own segmentation algorithm which can divide the page into blocks of 
text, which identifies each word, assuming two words are are anyhow 
separated by a minimum (lets say one space) distance, considering either 
horizontally or vertically. Has anyone got any resources, (like related 
research papers or so) for achieving this ? If someone was able to overcome 
this situation previously, please tell me how.



Thanks,
Sai ( www.tsndiffopera.in )

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Tesseract - Line Finding Algorithm

Reply via email to