[tesseract-ocr] Symbols and words segmentation with their location and line numbers from text image having C++ code

Pramit Mazumdar Mon, 06 Jul 2015 03:53:09 -0700


I have a C++ program stored in a text image. What I need is to extract the 
symbols, alphanumerics and tokens from the text image, along with their 
dimensions. Here dimension means the start row, end row, start column, end 
column pixel number, from the text image. Here I am citing an example: If 
there exists a text image with C++ code (image is in *.png format),


#include<iostreme>using namespace std; 

I have to write a matlab code which will read the above image, and generate 
the following dataset:

+-------------+-----------+-----------+---------+--------------+------------+| 
Line Number |   Item    | Start_Row | End_Row | Start_Column | End_Column 
|+-------------+-----------+-----------+---------+--------------+------------+| 
          1 | #         | ---       | ---     | ---          | ---        ||    
       1 | include   | ---       | ---     | ---          | ---        ||       
    1 | <         | ---       | ---     | ---          | ---        ||          
 1 | stdio.h   | ---       | ---     | ---          | ---        ||           1 
| >         | ---       | ---     | ---          | ---        ||           2 | 
using     | ---       | ---     | ---          | ---        ||           2 | 
namespace | ---       | ---     | ---          | ---        ||           2 | 
std       | ---       | ---     | ---          | ---        ||           2 | ;  
       | ---       | ---     | ---          | ---        
|+-------------+-----------+-----------+---------+--------------+------------+

I feel the entire objective can be segregated in to three parts: firstly, 
the word segmentation from text image. Secondly, identification of the 
coordinates. Thirdly, the counting of line numbers.

For the first two objective I have used the Tesseract-OCR which identifies 
words as well their respective co-ordinates. Below is the way I am 
extracting the words and respective coordinates. [I have manually converted 
the image from ONG to TIF format, as described in tesseract manual].

<Path to Tesseract-OCR folder>\tesseract.exe "image.tif" output \*extracts 
words*\<Path to Tesseract-OCR folder>\tesseract.exe "image.tif" output makebox 
\*extracts word dimensions*\

As an output I am getting the words extracted into a text file named 
output.txt. But, the *makebox*command is finding the coordinates of each 
single character in the image. Whereas I need to find coordinates of each 
single word (in this case symbols and tokens separately).

*So, my question is how could I generate such a text file which would show 
coordinates of each symbol, alphanumeric and tokens separately, instead of 
each characters.*

Is there any option in tesseract which can extract each word and its 
coordinates directly from the image file, instead of each character. I 
doubt whether I would need a lexical analyzer for performing this. If yes, 
then how could I be using it along with tesseract?

This is how I have approached the problem. If there exists any other simple 
way out to accomplish the goal, then please share it to me. Thank You.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1610828c-bf67-44ab-8394-7925a20ee343%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Symbols and words segmentation with their location and line numbers from text image having C++ code

Reply via email to