On Sunday, February 14, 2016 at 11:15:12 AM UTC-5, viraf wrote: > > > *Speed* > On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 > thread. I was looking for suggestions on how to speed up page processing. > I use parallelStream to process each page in a separate thread, >
You don't say what resolution or format images, what language(s), what version of Tesseract -- all of which are pretty critical when discussing performance. Having said that, I just ran a 110 page document in 272 seconds on a recent MacBook Pro. There were ~100 pages of mixed density text totalling 160k characters in CCITT G4 fax bitonal images of 2550x3300 pixels. That's four times the speed you quote, so I suspect you're reinitializing Tesseract for every page or taking a big hit on image processing or something else unrelated to the core OCR engine. > > *Training* > I am trying to learn about training Tesseract for improved accuracy. > Given that the fonts / box files used to generate eng.traindata are not > available can one specify the fonts used for english? > The font list is included in the eng.inttemp file that you extracted. Given that it's something like 350 fonts, you'd have to be looking at a pretty exotic font to need to retrain for that reason. > Also, is there a description of the various training artifacts ? I used > "combine_tessdata > -u" to unpack eng.traindata and "dawg2wordlist" to extract thee > wordlist, however was looking for documentation to better understand the > various training artifacts. > Have you reviewed the training documentation on the wiki? https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f08500be-d12d-44c7-b095-3c21f47477cb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

