Viraf, I'm bringing this thread back from the dead, but did you ever figure 
out how to squeeze out more performance from Tesseract?

On Sunday, February 21, 2016 at 8:15:52 AM UTC-8, viraf wrote:
>
> 1800 pages is on the larger side.  Files can range from a few pages to > 
> 1800 pages.  Initial tests were done with a document of 22 pages.  I ran a 
> test you outlined below on a 372 page file on a linux guest VM using 
> Tesseract 3.04 and results were disappointing (approx 3 PPM).  I then ran 
> my initial test application with Tess4J on the 372 pages and results were 
> approximately 9 PPM.  The init does not appear to be as expensive as 
> thought - 
>
>
> Pages
> 372 Time (ms)
> 2395903 PPM
> 9.315903 
> 372 2293524 9.731749 
> The first run was with instantiating a new engine for each page and 
> calling init/setTessVariables and disposing at the end.  The second run was 
> with allocation, init/setTessVariables and disposing moved out of the loop. 
>  I am calling ProcessPage specifying a text renderer (earlier test 
> generated hocr and pdf file).
>
> So, I will deploy this code on the Linux guest VM and see if I get similar 
> results.  The speed difference could be related to tesseract build options 
> between windows and Linux.  
>
> - viraf
>
> On Saturday, February 20, 2016 at 11:55:43 AM UTC-5, Tom Morris wrote:
>>
>> On Friday, February 19, 2016 at 3:00:42 PM UTC-5, viraf wrote:
>>>
>>> Tom, I created a multi-page TIFF as per earlier recommendation on this 
>>> thread (avoid multiple inits).  Running it on Linux from the command line 
>>> provided me with a reference by which to compute PPM that I could target 
>>> with Tess4J.  I had hoped to get 10+ PPM / core and shift focus on 
>>> accuracy.  I am at about 6 PPM and unclear where / how to improve 
>>> performance (speed).  
>>>
>>
>> I take it the question about the representativeness of that size file was 
>> too sensitive/boring/trivial/... to answer. 
>>
>> Given the issues with multi-page TIFFs, one experiment worth running is 
>> to try a list of single page TIFFs instead of one ridiculously large file.
>>
>> $ cat > filelist.txt
>> page0001.tif
>> page0002.tif
>> ...
>> page1800.tif
>>
>> $ tesseract filelist.txt
>>
>> Tom
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/41d983c8-41f5-4377-aa52-90591e107ebd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to