I am trying to OCR documents that we receive over FTP.  The documents are 
PDF files that contain images.  We process the PDF, extracting each page as 
a TIFF (CCITT T.6) file that is 2509x3530 pixels, 300 dpi, 1 bit depth.  

As accuracy is not the best, I am looking at better understanding how to 
train tesseract.  As a first step, I was wondering what fonts were used in 
generating eng.traineddata ?  I have unpacked eng.traineddata using 
"combind_tessdata -u" and extracted the wordlist using dawg2wordlist, and 
am now trying to understand what the various artifacts are and how they are 
used.  Is there are description available ?  

I was also wondering how one may improve speed of processing.  On a i7 
4800-MQ @ 2.7GHz I was getting approximately 6 PPM using 1 thread with 
Tess4J 3.0.0.  

Thanks

- viraf

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6b80a598-0719-41ee-9df5-01fe079975b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to