I have a C++ program stored in a text image. What I need is to extract the
symbols, alphanumerics and tokens from the text image, along with their
dimensions. Here dimension means the start row, end row, start column, end
column pixel number, from the text image. Here I am citing an example: If
there exists a text image with C++ code (image is in *.png format),
#include<iostreme>using namespace std;
I have to write a matlab code which will read the above image, and generate
the following dataset:
+-------------+-----------+-----------+---------+--------------+------------+|
Line Number | Item | Start_Row | End_Row | Start_Column | End_Column
|+-------------+-----------+-----------+---------+--------------+------------+|
1 | # | --- | --- | --- | --- ||
1 | include | --- | --- | --- | --- ||
1 | < | --- | --- | --- | --- ||
1 | stdio.h | --- | --- | --- | --- || 1
| > | --- | --- | --- | --- || 2 |
using | --- | --- | --- | --- || 2 |
namespace | --- | --- | --- | --- || 2 |
std | --- | --- | --- | --- || 2 | ;
| --- | --- | --- | ---
|+-------------+-----------+-----------+---------+--------------+------------+
I feel the entire objective can be segregated in to three parts: firstly,
the word segmentation from text image. Secondly, identification of the
coordinates. Thirdly, the counting of line numbers.
For the first two objective I have used the Tesseract-OCR which identifies
words as well their respective co-ordinates. Below is the way I am
extracting the words and respective coordinates. [I have manually converted
the image from ONG to TIF format, as described in tesseract manual].
<Path to Tesseract-OCR folder>\tesseract.exe "image.tif" output \*extracts
words*\<Path to Tesseract-OCR folder>\tesseract.exe "image.tif" output makebox
\*extracts word dimensions*\
As an output I am getting the words extracted into a text file named
output.txt. But, the *makebox*command is finding the coordinates of each
single character in the image. Whereas I need to find coordinates of each
single word (in this case symbols and tokens separately).
*So, my question is how could I generate such a text file which would show
coordinates of each symbol, alphanumeric and tokens separately, instead of
each characters.*
Is there any option in tesseract which can extract each word and its
coordinates directly from the image file, instead of each character. I
doubt whether I would need a lexical analyzer for performing this. If yes,
then how could I be using it along with tesseract?
This is how I have approached the problem. If there exists any other simple
way out to accomplish the goal, then please share it to me. Thank You.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/1610828c-bf67-44ab-8394-7925a20ee343%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.