Viraf, I'm bringing this thread back from the dead, but did you ever figure out how to squeeze out more performance from Tesseract?
On Sunday, February 21, 2016 at 8:15:52 AM UTC-8, viraf wrote: > > 1800 pages is on the larger side. Files can range from a few pages to > > 1800 pages. Initial tests were done with a document of 22 pages. I ran a > test you outlined below on a 372 page file on a linux guest VM using > Tesseract 3.04 and results were disappointing (approx 3 PPM). I then ran > my initial test application with Tess4J on the 372 pages and results were > approximately 9 PPM. The init does not appear to be as expensive as > thought - > > > Pages > 372 Time (ms) > 2395903 PPM > 9.315903 > 372 2293524 9.731749 > The first run was with instantiating a new engine for each page and > calling init/setTessVariables and disposing at the end. The second run was > with allocation, init/setTessVariables and disposing moved out of the loop. > I am calling ProcessPage specifying a text renderer (earlier test > generated hocr and pdf file). > > So, I will deploy this code on the Linux guest VM and see if I get similar > results. The speed difference could be related to tesseract build options > between windows and Linux. > > - viraf > > On Saturday, February 20, 2016 at 11:55:43 AM UTC-5, Tom Morris wrote: >> >> On Friday, February 19, 2016 at 3:00:42 PM UTC-5, viraf wrote: >>> >>> Tom, I created a multi-page TIFF as per earlier recommendation on this >>> thread (avoid multiple inits). Running it on Linux from the command line >>> provided me with a reference by which to compute PPM that I could target >>> with Tess4J. I had hoped to get 10+ PPM / core and shift focus on >>> accuracy. I am at about 6 PPM and unclear where / how to improve >>> performance (speed). >>> >> >> I take it the question about the representativeness of that size file was >> too sensitive/boring/trivial/... to answer. >> >> Given the issues with multi-page TIFFs, one experiment worth running is >> to try a list of single page TIFFs instead of one ridiculously large file. >> >> $ cat > filelist.txt >> page0001.tif >> page0002.tif >> ... >> page1800.tif >> >> $ tesseract filelist.txt >> >> Tom >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/41d983c8-41f5-4377-aa52-90591e107ebd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

