I ran a test with a multipage tiiff, and am getting the same results of approximately 6 PPM. I used the following command to create the multipage TIFF gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300 /media/sf_shared/00473706.PDF
and ran it under Windows and Linux. Here is the Linux output: Tue Feb 16 08:55:14 EST 2016 Tesseract Open Source OCR Engine v3.04.00 with Leptonica Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 OSD: Weak margin (4.51) for 95 blob text block, but using orientation anyway: 0 Page 14 Page 15 Page 16 Page 17 Page 18 Page 19 OSD: Weak margin (6.28) for 1715 blob text block, but using orientation anyway: 0 Page 20 OSD: Weak margin (2.15) for 1383 blob text block, but using orientation anyway: 0 Page 21 Page 22 Tue Feb 16 08:59:24 EST 2016 You had mentioned spending time on image processing, so was wondering what the "OSD Weak Margin" messages mean. The script used to OCR is date tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr date Any suggestions on where to investigate next would be appreciated. Thanks - viraf On Tuesday, February 16, 2016 at 8:17:53 AM UTC-5, viraf wrote: > > Thanks for the clarification. I now know that 24 PPM on a single thread > should be achievable. I'll update the post after trying a few options. > Thanks for your help. > > - viraf > > On Tuesday, February 16, 2016 at 1:53:40 AM UTC-5, Tom Morris wrote: >> >> On Mon, Feb 15, 2016 at 8:24 PM, viraf <[email protected]> wrote: >> >>> Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 >>> bit - i.e. BW). Th language is english. >>> >> >> So, roughly the same resolution and format as I used, but only 1/4 the >> speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz Intel Core >> i7 (and no, it's not using OpenCL, the GPU, or multiple threads). >> >> >>> I am using Tess4j 3.0, which includes Tesseract 3.0.4. I am >>> instantiating a new Tesseract object for each page, however the cost was >>> minimal (74ms) for the total run. >>> >> >> I'm not familiar with the Tess4J wrapper, but that sounds pretty low for >> initialization cost. Are you sure you're measuring the true cost (ie you're >> not being fooled by lazy initialization)? What happens when you combine all >> the pages into a single multi-page TIFF and OCR it (so you can be sure >> you've amortized the initialization cost)? >> >> When you state "taking a big hit on image processing" how would I be able >>> to isolate the issue to image processing? >>> >> >> I was mainly talking about operations like thresholding, format >> conversion, etc to get to a usable image. That's obviously not applicable >> if you're working with bitonal images (which you hadn't disclosed when I >> wrote my reply). >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

