1800 pages is on the larger side. Files can range from a few pages to > 1800 pages. Initial tests were done with a document of 22 pages. I ran a test you outlined below on a 372 page file on a linux guest VM using Tesseract 3.04 and results were disappointing (approx 3 PPM). I then ran my initial test application with Tess4J on the 372 pages and results were approximately 9 PPM. The init does not appear to be as expensive as thought -
Pages 372 Time (ms) 2395903 PPM 9.315903 372 2293524 9.731749 The first run was with instantiating a new engine for each page and calling init/setTessVariables and disposing at the end. The second run was with allocation, init/setTessVariables and disposing moved out of the loop. I am calling ProcessPage specifying a text renderer (earlier test generated hocr and pdf file). So, I will deploy this code on the Linux guest VM and see if I get similar results. The speed difference could be related to tesseract build options between windows and Linux. - viraf On Saturday, February 20, 2016 at 11:55:43 AM UTC-5, Tom Morris wrote: > > On Friday, February 19, 2016 at 3:00:42 PM UTC-5, viraf wrote: >> >> Tom, I created a multi-page TIFF as per earlier recommendation on this >> thread (avoid multiple inits). Running it on Linux from the command line >> provided me with a reference by which to compute PPM that I could target >> with Tess4J. I had hoped to get 10+ PPM / core and shift focus on >> accuracy. I am at about 6 PPM and unclear where / how to improve >> performance (speed). >> > > I take it the question about the representativeness of that size file was > too sensitive/boring/trivial/... to answer. > > Given the issues with multi-page TIFFs, one experiment worth running is to > try a list of single page TIFFs instead of one ridiculously large file. > > $ cat > filelist.txt > page0001.tif > page0002.tif > ... > page1800.tif > > $ tesseract filelist.txt > > Tom > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8b71c534-ec5c-4abf-abf4-f84734312e26%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

