Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit 
- i.e. BW). Th language is english.  I am using Tess4j 3.0, which includes 
Tesseract 3.0.4.  I am instantiating a new Tesseract object for each page, 
however the cost was minimal (74ms) for the total run.  I'll investigate 
further whether the Java API's are calling init elsewhere.  

When you state "taking a big hit on image processing" how would I be able 
to isolate the issue to image processing?  

Thanks for your help.  

- viraf


On Monday, February 15, 2016 at 1:22:57 PM UTC-5, Tom Morris wrote:
>
>
>
> On Sunday, February 14, 2016 at 11:15:12 AM UTC-5, viraf wrote:
>>
>>
>> *Speed*
>> On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 
>> thread.  I was looking for suggestions on how to speed up page processing. 
>>  I use parallelStream to process each page in a separate thread,
>>
>
> You don't say what resolution or format images, what language(s), what 
> version of Tesseract -- all of which are pretty critical when discussing 
> performance.  Having said that, I just ran a 110 page document in 272 
> seconds on a recent MacBook Pro.  There were ~100 pages of mixed density 
> text totalling 160k characters in CCITT G4 fax bitonal images of 2550x3300 
> pixels.
>
> That's four times the speed you quote, so I suspect you're reinitializing 
> Tesseract for every page or taking a big hit on image processing or 
> something else unrelated to the core OCR engine.
>  
>
>>
>> *Training*
>> I am trying to learn about training Tesseract for improved accuracy. 
>>  Given that the fonts / box files used to generate eng.traindata are not 
>> available can one specify the fonts used for english?  
>>
>
> The font list is included in the eng.inttemp file that you extracted. 
> Given that it's something like 350 fonts, you'd have to be looking at a 
> pretty exotic font to need to retrain for that reason.
>  
>
>> Also, is there a description of the various training artifacts ?  I used 
>> "combine_tessdata 
>> -u" to unpack eng.traindata and  "dawg2wordlist" to extract thee 
>> wordlist, however was looking for documentation to better understand the 
>> various training artifacts.
>>
>
> Have you reviewed the training documentation on the wiki?
>
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
>
> Tom
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/32b7d6bb-42aa-4900-ba53-6f64a0631881%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to