1800 pages is on the larger side.  Files can range from a few pages to > 
1800 pages.  Initial tests were done with a document of 22 pages.  I ran a 
test you outlined below on a 372 page file on a linux guest VM using 
Tesseract 3.04 and results were disappointing (approx 3 PPM).  I then ran 
my initial test application with Tess4J on the 372 pages and results were 
approximately 9 PPM.  The init does not appear to be as expensive as 
thought - 


Pages
372 Time (ms)
2395903 PPM
9.315903 
372 2293524 9.731749 
The first run was with instantiating a new engine for each page and calling 
init/setTessVariables and disposing at the end.  The second run was with 
allocation, init/setTessVariables and disposing moved out of the loop.  I 
am calling ProcessPage specifying a text renderer (earlier test generated 
hocr and pdf file).

So, I will deploy this code on the Linux guest VM and see if I get similar 
results.  The speed difference could be related to tesseract build options 
between windows and Linux.  

- viraf

On Saturday, February 20, 2016 at 11:55:43 AM UTC-5, Tom Morris wrote:
>
> On Friday, February 19, 2016 at 3:00:42 PM UTC-5, viraf wrote:
>>
>> Tom, I created a multi-page TIFF as per earlier recommendation on this 
>> thread (avoid multiple inits).  Running it on Linux from the command line 
>> provided me with a reference by which to compute PPM that I could target 
>> with Tess4J.  I had hoped to get 10+ PPM / core and shift focus on 
>> accuracy.  I am at about 6 PPM and unclear where / how to improve 
>> performance (speed).  
>>
>
> I take it the question about the representativeness of that size file was 
> too sensitive/boring/trivial/... to answer. 
>
> Given the issues with multi-page TIFFs, one experiment worth running is to 
> try a list of single page TIFFs instead of one ridiculously large file.
>
> $ cat > filelist.txt
> page0001.tif
> page0002.tif
> ...
> page1800.tif
>
> $ tesseract filelist.txt
>
> Tom
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8b71c534-ec5c-4abf-abf4-f84734312e26%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to