I am doing the process of cleaning up and image using leptonica and then
passing it to tesseract for OCR.However it is not able to recognize the
characters
even though the image is of high quality.The image specifications are as
follows.
1 bpp, uncompressed, 1280 * 960 , 300dpi horizontal and vertical resolution
Following are the image processing operations I carry out in sequence using
leptonica
pixConvertTo8pixBackgroundNormSimplepixOtsuAdaptiveThresholdpixContrastTRC
{Regarding this - I am passing high values like 1.0 or even 5.0 but image
doesnt really change}pixFindSkewpixRotate { rotate by angle found by
pixFindSkew}
pixRotate90 {do this 4 times to read image in all 4
orientations}pixClipRectangle {crop image}
Finally tesseract command.
I get garbage characters in the output.A sample Input Image is as follows
<https://lh4.googleusercontent.com/-kG9mHG4xOVQ/U9DQ7tIxibI/AAAAAAAABME/G88fZRRRCgU/s1600/90_cropped.tif>
The output that i get is as follows
Final K-1
II]
s h d | K-1 ,.,
(F°o.~?n‘i&1) 5/>.©12 mm E2‘;
Deparlrnenl of tho Treasury , ,
I 1 I l I
‘mama, Ravenuo SGMW For cnlundm your 201), ‘ " °F°$ "'100fTIO
or lax yum boqmnnnq 7 _ 20\Q_
‘ 7660
and ondmg _ W vv I go
Beneï¬ciary's Share of Income, Deductions,cl'editS, etc. F 800 buck 01 loam nnd
lnstruoflons»
___lnformatI0n About mo Estate or Trust
‘ Ordmary d|v|dmi 12113
_
‘; Quahfmd dlVIdG
\ 8132
3 1
Net shun-term
A Estate's at trust's omgiuym ldonnlmnluon numbol
56-0987654
B Estate's u trust‘: namo
ESTATE OF MARTHA SMITH
0 Fiduc§ary's name, address, clly, smlu‘ and /IP codo
N01 long~lerm c
\ 24043 u
‘ 28% vale gann
Ti
Unreptumd 5
Omar porfloho 4nonbuslness lfll
/\..4........ L. ._.._ ,.
What Should i do to improve the accuracy.
Part 2:
I tried to follow this link
<http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data>.And
created a eng.user-words.traineddata file and bazaar.train file and tried to
run with "bazaar" as additional parameter.but i get "read_params_file: can't
open bazaar error". Any suggestions?
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/787a79de-4a34-4882-bbe4-a601cdb6cd32%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.