I am trying to OCR documents that we receive over FTP. The documents are PDF files that contain images. We process the PDF, extracting each page as a TIFF (CCITT T.6) file that is 2509x3530 pixels, 300 dpi, 1 bit depth.
As accuracy is not the best, I am looking at better understanding how to train tesseract. As a first step, I was wondering what fonts were used in generating eng.traineddata ? I have unpacked eng.traineddata using "combind_tessdata -u" and extracted the wordlist using dawg2wordlist, and am now trying to understand what the various artifacts are and how they are used. Is there are description available ? I was also wondering how one may improve speed of processing. On a i7 4800-MQ @ 2.7GHz I was getting approximately 6 PPM using 1 thread with Tess4J 3.0.0. Thanks - viraf -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6b80a598-0719-41ee-9df5-01fe079975b1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

