I have a PDF file containing some tabular data.

http://dl.dropbox.com/u/44235928/sample_rotate-0.pdf

I have to extract the tabular data from it. I am converting the pdf into 
image using imagemagic convert utility and then processing those images -

convert -rotate 90 -geometry 10000 -depth 8 -density 800 sample.pdf 
img_800_10000.tif;


Since my pdf file consists of only alphabets and numbers, i have created a 
config file named letters to white-list the alphanumeric characters ( 
and avoid the junk characters)

tessedit_char_whitelist 
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,./-()$@_#

I am running tesseract as -

tesseract img_800_10000.tif img_800_10000.tif nobatch letters;


This way I am getting approximately 80% correct results( correct data ).


I repeated the the process by creating the image manually. I opened the pdf 
file, zoomed it , took the screen-shot, cropped about 15-20 rows of the 
table and then processed the cropped image with tesseract. I got 100% 
accuracy.

It means that there is something wrong in creating the image from the pdf 
file using the imagemagic-convert utility. The pdf file seems to be of very 
good quality, because even after zooming it highly, it is still giving 
crisp fonts. 

Can tesseract-ocr directly reads from pdf instead of tif images ?

If no- how could I create a good quality image from the pdf to fed 
tesseract for better accuracy ( I am able to create it manually but I would 
prefer to do it through some script) .

Please suggest the parameter values(density, geometry, depth, monochrome 
etc) for convert.


Thanks
Piyush






















-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to