[dgplug-users] Fwd: Indic OCR workout at foss.in

Debayan Banerjee Wed, 15 Oct 2008 00:23:28 -0700

---------- Forwarded message ----------
From: Debayan Banerjee <[EMAIL PROTECTED]>
Date: 2008/10/15
Subject: Indic OCR workout at foss.in
To: [EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED]



Dear listmates,
There is a proposal for Indic OCR (TesseractIndic) workout at foss.in 2008
from IndLinux. So here I start the thread that will gather people interested
in this.
All my work is documented in detail on
http://debayanin.googlepages.com/hackingtesseract . The latest entry is
specifically for people who want to join the effort. Please go through and
comment:
*
TesseractIndic @ foss.in 2008
*

Note: TesseractIndic is
Tesseract-OCR<http://code.google.com/p/tesseract-ocr/>with Indic
script support. This will remain a separate project untill
Tesseract-OCR actually decides to accept patches and merge Indic script
support. TesseractIndic can be found
here<http://code.google.com/p/tesseractindic/>.


So lets see where we stand. We have Tesseract-OCR, which works great for
english. I managed to apply
"maatraa<http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf>
clipping <http://sites.google.com/site/ocropus/morphological-operations>"
(which is a new term/approach in the world of OCR i think!) successfully as
a proof of concept to the image being fed to the Tesseract OCR engine.
Accuracy obtained by this method, along with some really crappy training,
stands at about 85%.

A standard OCR process contains the following steps:

(1) Pre-processing, involving skew removal, etc. Pretty much
   language-independent, though features like the shirorekha
   might help here.
(2) Character extraction: Again, largely language-independent,
   though language dependency might come in because of
   features like shirorekha.
(3) Character identification: Language independent, maybe with
   specialised plugins to take advantage of language features,
   or items like known fonts.
(4) Post-processing, which involves things like spell-checking to
   improve accuracy.

The current available version of Tesseract OCR does steps 3, and 4 above for
any language. But that it can only do if it can do step 2 properly, which it
cant for connected script like Hindi, Bengali etc. So the approach is to
take the scanned image, apply some pre-processing to it, and then do the
"maatraa clipping" operation on it. Now feed this image to Tesseract-OCR
engine.

In detail, the things to do are:

(1) Pre-processing: Skew
removal<http://tesseractindic.googlecode.com/files/skew_deskew.pdf>,
Noise removal. Skew removal in particular is key for the "maatraa clipping"
code to work.

(2) "maatraa clipping" : This enables the Tesseract-OCR engine to treat
Devnagri connected script like any other script.

(3) Training <http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract>:
Very Important for getting good results. But well documented. Good tools
exist for training Tesseract-OCR.

(4) Web Interface: We need to create a web interface so people can freely
OCR their documents online. No big deal.

Now my intention is to implement skew removal using Hough
transforms<http://yaniv.leviathanonline.com/blog/math/understanding-soccer/>.
Hough 
transforms<http://danthorpe.me.uk/blog/2005/02/24/Implementing_the_Hough_Transform>are
really good in finding staright lines (among other shapes) in images.
So
all we need to do is, find the "maatraas" and calculate thier slope. We have
the skew angle, and we just rotate the page to correct the skew.

I had implemented "maatraa clipping" using projection based methods. It
seems there is a better digital image processing method called
"Morphological Operations" that is a better way of doing it. Well, actually
i am not that sure about it yet. Still researching and trying out stuff.

Now, I had done all this work in C++, as the Tesseract-OCR code is also in
C++. But, of late, i have been mesmerised by the simplicity and power of
Python , and the Python image library. All the work i am doing now,
including Hough transfroms, is in Python. So now we have 2 options:

(1) Do the pre-processing and "maatraa clipping" in Python and feed the page
to the Tesseract-OCR (will be easy and quicker to implement)

(2) Do the entire thing in C++ (will execute much faster)

Again, we will probably end up doing both. In foss.in, I will probably bring
along Python code that already works, and ask people to port it to C++ and
merge upstream to TesseractIndic. Or we could ask people to implement
algorithms of their choice in the language of their choice on a common set
of test images and then shall convert that stuff to C++ and add.


-- 
BE INTELLIGENT, USE LINUX




-- 
BE INTELLIGENT, USE LINUX

_______________________________________________
Users mailing list
[email protected]
http://lists.dgplug.org/listinfo.cgi/users-dgplug.org

[dgplug-users] Fwd: Indic OCR workout at foss.in

Reply via email to