On Sunday, February 14, 2016 at 11:15:12 AM UTC-5, viraf wrote:
>
>
> *Speed*
> On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 
> thread.  I was looking for suggestions on how to speed up page processing. 
>  I use parallelStream to process each page in a separate thread,
>

You don't say what resolution or format images, what language(s), what 
version of Tesseract -- all of which are pretty critical when discussing 
performance.  Having said that, I just ran a 110 page document in 272 
seconds on a recent MacBook Pro.  There were ~100 pages of mixed density 
text totalling 160k characters in CCITT G4 fax bitonal images of 2550x3300 
pixels.

That's four times the speed you quote, so I suspect you're reinitializing 
Tesseract for every page or taking a big hit on image processing or 
something else unrelated to the core OCR engine.
 

>
> *Training*
> I am trying to learn about training Tesseract for improved accuracy. 
>  Given that the fonts / box files used to generate eng.traindata are not 
> available can one specify the fonts used for english?  
>

The font list is included in the eng.inttemp file that you extracted. Given 
that it's something like 350 fonts, you'd have to be looking at a pretty 
exotic font to need to retrain for that reason.
 

> Also, is there a description of the various training artifacts ?  I used 
> "combine_tessdata 
> -u" to unpack eng.traindata and  "dawg2wordlist" to extract thee 
> wordlist, however was looking for documentation to better understand the 
> various training artifacts.
>

Have you reviewed the training documentation on the wiki?

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

Tom
 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f08500be-d12d-44c7-b095-3c21f47477cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to