Tom, on the item of fonts, eng.inttemp is a binary file in 3.0.4. I did not see a command to extract its contents. Do you have suggestions on how to review this file ? Thanks - viraf
On Monday, February 15, 2016 at 1:22:57 PM UTC-5, Tom Morris wrote: > > > > On Sunday, February 14, 2016 at 11:15:12 AM UTC-5, viraf wrote: >> >> >> *Speed* >> On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 >> thread. I was looking for suggestions on how to speed up page processing. >> I use parallelStream to process each page in a separate thread, >> > > You don't say what resolution or format images, what language(s), what > version of Tesseract -- all of which are pretty critical when discussing > performance. Having said that, I just ran a 110 page document in 272 > seconds on a recent MacBook Pro. There were ~100 pages of mixed density > text totalling 160k characters in CCITT G4 fax bitonal images of 2550x3300 > pixels. > > That's four times the speed you quote, so I suspect you're reinitializing > Tesseract for every page or taking a big hit on image processing or > something else unrelated to the core OCR engine. > > >> >> *Training* >> I am trying to learn about training Tesseract for improved accuracy. >> Given that the fonts / box files used to generate eng.traindata are not >> available can one specify the fonts used for english? >> > > The font list is included in the eng.inttemp file that you extracted. > Given that it's something like 350 fonts, you'd have to be looking at a > pretty exotic font to need to retrain for that reason. > > >> Also, is there a description of the various training artifacts ? I used >> "combine_tessdata >> -u" to unpack eng.traindata and "dawg2wordlist" to extract thee >> wordlist, however was looking for documentation to better understand the >> various training artifacts. >> > > Have you reviewed the training documentation on the wiki? > > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract > > Tom > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f5381e45-ad00-42bf-8dbc-a0a7c15f2903%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

