I am new to tesseract and using it through Tess4J.  I am trying to OCR 
faxes where pages are represented as TIFF (CCITT T.6) images - 2509 x 3530 
@ 300 dpi (1 bit - i.e. BW).  

I have two set of questions

*Speed*
On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 
thread.  I was looking for suggestions on how to speed up page processing. 
 I use parallelStream to process each page in a separate thread,

*Training*
I am trying to learn about training Tesseract for improved accuracy.  Given 
that the fonts / box files used to generate eng.traindata are not available 
can one specify the fonts used for english?  
Also, is there a description of the various training artifacts ?  I used 
"combine_tessdata 
-u" to unpack eng.traindata and  "dawg2wordlist" to extract thee wordlist, 
however was looking for documentation to better understand the various 
training artifacts.

Thanks

- viraf

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2c982172-9eb4-4e0c-b65a-74b6c3c2064b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to