Tom, on the item of fonts, eng.inttemp  is a binary file in 3.0.4.  I did 
not see a command to extract its contents.  Do you have suggestions on how 
to review this file ?  Thanks - viraf

On Monday, February 15, 2016 at 1:22:57 PM UTC-5, Tom Morris wrote:
>
>
>
> On Sunday, February 14, 2016 at 11:15:12 AM UTC-5, viraf wrote:
>>
>>
>> *Speed*
>> On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 
>> thread.  I was looking for suggestions on how to speed up page processing. 
>>  I use parallelStream to process each page in a separate thread,
>>
>
> You don't say what resolution or format images, what language(s), what 
> version of Tesseract -- all of which are pretty critical when discussing 
> performance.  Having said that, I just ran a 110 page document in 272 
> seconds on a recent MacBook Pro.  There were ~100 pages of mixed density 
> text totalling 160k characters in CCITT G4 fax bitonal images of 2550x3300 
> pixels.
>
> That's four times the speed you quote, so I suspect you're reinitializing 
> Tesseract for every page or taking a big hit on image processing or 
> something else unrelated to the core OCR engine.
>  
>
>>
>> *Training*
>> I am trying to learn about training Tesseract for improved accuracy. 
>>  Given that the fonts / box files used to generate eng.traindata are not 
>> available can one specify the fonts used for english?  
>>
>
> The font list is included in the eng.inttemp file that you extracted. 
> Given that it's something like 350 fonts, you'd have to be looking at a 
> pretty exotic font to need to retrain for that reason.
>  
>
>> Also, is there a description of the various training artifacts ?  I used 
>> "combine_tessdata 
>> -u" to unpack eng.traindata and  "dawg2wordlist" to extract thee 
>> wordlist, however was looking for documentation to better understand the 
>> various training artifacts.
>>
>
> Have you reviewed the training documentation on the wiki?
>
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
>
> Tom
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f5381e45-ad00-42bf-8dbc-a0a7c15f2903%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to