Thanks - I appreciate your help. I ran perf tool and noticed that 40% of the time is spent in IntegerMatcher::UpdateTablesForFeatures.
Can you try to see if you get the same results on a non mac? Someone suggested that the Mac may automatically use the co-processor. Thanks - viraf On Tuesday, February 16, 2016 at 1:13:01 PM UTC-5, Tom Morris wrote: > > Actually, I think the resolution specified in my TIFFs is a red herring > and wrong, because the image sizes are the same as your originals. I'm not > aware of any standard images and test timings. There are two test images > in the source repo, but they're too small to be useful for any type of > performance work. > > For the record, here's what my TIFF images look like: > > TIFF Directory at offset 0xabd56a (11261290) > Image Width: 3400 Image Length: 4401 > Resolution: 204, 196 pixels/inch > Bits/Sample: 1 > Compression Scheme: CCITT Group 3 > Photometric Interpretation: min-is-white > FillOrder: lsb-to-msb > Orientation: row 0 top, col 0 lhs > Samples/Pixel: 1 > Rows/Strip: (infinite) > Planar Configuration: single image plane > Page Number: 1-0 > Software: fax2tiff > Group 3 Options: (0 = 0x0) > Fax Data: clean (0 = 0x0) > Bad Fax Lines: 0 > Consecutive Bad Fax Lines: 0 > > I don't think there's anything significant difference in the images. Just > for grins I reinstalled the 3.04.00 MacPorts version of tesseract and it > took 3min21sec for the same file that takes 4min05sec with the current > development build, so it doesn't look like there have been any recent > performance improvements and perhaps even the opposite (hmmm). > > I think I've exhausted my easy suggestions for remote control (free) > performance analysis, but I'm interested in hearing what, if anything, you > find out. > > Tom > > On Tue, Feb 16, 2016 at 11:09 AM, viraf <[email protected] > <javascript:>> wrote: > >> My timings were just for Tesseract to process the image. I tried using >> standard Fax settings which improved processing time to about 8 PPM. I was >> using 300 dpi as per recommendations on many forum postings. Enclosed is >> the tiffinfo for the >> >> TIFF Directory at offset 0x8 (8) >> Subfile Type: multi-page document (2 = 0x2) >> Image Width: 1728 Image Length: 2292 >> Resolution: 204, 196 pixels/inch >> Bits/Sample: 1 >> Compression Scheme: CCITT Group 4 >> Photometric Interpretation: min-is-white >> FillOrder: msb-to-lsb >> Orientation: row 0 top, col 0 lhs >> Samples/Pixel: 1 >> Rows/Strip: 4969 >> Planar Configuration: single image plane >> Page Number: 0-0 >> Software: GPL Ghostscript 9.16 >> DateTime: 2016:02:16 10:43:39 >> Group 4 Options: (0 = 0x0) >> >> I'll look at building a new release - but that has its own challenges as >> it is not a release. Do you have any other suggestions for me to >> consider? Do you know if there are sample images that were used for >> testing, where we have some metrics on speed. This would help me isolate >> the problem to the images or to my build. >> >> - viraf >> >> >> >> >> >> On Tuesday, February 16, 2016 at 10:31:13 AM UTC-5, Tom Morris wrote: >>> >>> My pipeline for this kind of stuff uses: >>> >>> pdfimages - to extract the images >>> faxtotiff - to convert CCITT to TIFF (using the parameters file >>> generated by pdfimages) >>> tiffcp - to concatenate multiple TIFFs together into one big one >>> >>> but the important thing is the resulting TIFF. You could try running >>> tiffinfo on it to see if anything looks funny. One thing I wonder about is >>> the 300x300 resolution. My images are the standard (for fax), 204x196 >>> pixels/inch, so you've got double the pixels to start. That's likely one >>> factor of 2 right there. Having Ghostscript do a full rendering at that >>> resolution with the necessary image transforms can't be very fast. My >>> pipeline takes 5 seconds for a 110 page document. Also, depending on what >>> your starting resolution is, any image scaling is likely degrading the >>> image quality. >>> >>> It seems unlikely that there have been huge performance changes in the >>> last six months, but you could try building from source to see if it makes >>> a difference. I'm using the latest 3.05 head sources from Github. >>> >>> Tom >>> >>> p.s. One caveat - I think faxtotiff, as distributed, is broken and I >>> haven't had a chance to contribute my fixes back upstream yet. >>> >>> On Tue, Feb 16, 2016 at 9:11 AM, viraf <[email protected]> wrote: >>> >>>> I ran a test with a multipage tiiff, and am getting the same results of >>>> approximately 6 PPM. >>>> I used the following command to create the multipage TIFF >>>> gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300 >>>> /media/sf_shared/00473706.PDF >>>> >>>> and ran it under Windows and Linux. Here is the Linux output: >>>> >>>> Tue Feb 16 08:55:14 EST 2016 >>>> Tesseract Open Source OCR Engine v3.04.00 with Leptonica >>>> Page 1 >>>> Page 2 >>>> Page 3 >>>> Page 4 >>>> Page 5 >>>> Page 6 >>>> Page 7 >>>> Page 8 >>>> Page 9 >>>> Page 10 >>>> Page 11 >>>> Page 12 >>>> Page 13 >>>> OSD: Weak margin (4.51) for 95 blob text block, but using orientation >>>> anyway: 0 >>>> Page 14 >>>> Page 15 >>>> Page 16 >>>> Page 17 >>>> Page 18 >>>> Page 19 >>>> OSD: Weak margin (6.28) for 1715 blob text block, but using orientation >>>> anyway: 0 >>>> Page 20 >>>> OSD: Weak margin (2.15) for 1383 blob text block, but using orientation >>>> anyway: 0 >>>> Page 21 >>>> Page 22 >>>> Tue Feb 16 08:59:24 EST 2016 >>>> >>>> You had mentioned spending time on image processing, so was wondering >>>> what the "OSD Weak Margin" messages mean. The script used to OCR is >>>> >>>> date >>>> tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr >>>> date >>>> >>>> Any suggestions on where to investigate next would be appreciated. >>>> >>>> Thanks >>>> >>>> - viraf >>>> >>>> >>>> On Tuesday, February 16, 2016 at 8:17:53 AM UTC-5, viraf wrote: >>>>> >>>>> Thanks for the clarification. I now know that 24 PPM on a single >>>>> thread should be achievable. I'll update the post after trying a few >>>>> options. >>>>> Thanks for your help. >>>>> >>>>> - viraf >>>>> >>>>> On Tuesday, February 16, 2016 at 1:53:40 AM UTC-5, Tom Morris wrote: >>>>>> >>>>>> On Mon, Feb 15, 2016 at 8:24 PM, viraf <[email protected]> wrote: >>>>>> >>>>>>> Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi >>>>>>> (1 bit - i.e. BW). Th language is english. >>>>>>> >>>>>> >>>>>> So, roughly the same resolution and format as I used, but only 1/4 >>>>>> the speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz >>>>>> Intel >>>>>> Core i7 (and no, it's not using OpenCL, the GPU, or multiple threads). >>>>>> >>>>>> >>>>>>> I am using Tess4j 3.0, which includes Tesseract 3.0.4. I am >>>>>>> instantiating a new Tesseract object for each page, however the cost >>>>>>> was >>>>>>> minimal (74ms) for the total run. >>>>>>> >>>>>> >>>>>> I'm not familiar with the Tess4J wrapper, but that sounds pretty low >>>>>> for initialization cost. Are you sure you're measuring the true cost (ie >>>>>> you're not being fooled by lazy initialization)? What happens when you >>>>>> combine all the pages into a single multi-page TIFF and OCR it (so you >>>>>> can >>>>>> be sure you've amortized the initialization cost)? >>>>>> >>>>>> When you state "taking a big hit on image processing" how would I be >>>>>>> able to isolate the issue to image processing? >>>>>>> >>>>>> >>>>>> I was mainly talking about operations like thresholding, format >>>>>> conversion, etc to get to a usable image. That's obviously not >>>>>> applicable >>>>>> if you're working with bitonal images (which you hadn't disclosed when I >>>>>> wrote my reply). >>>>>> >>>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "tesseract-ocr" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe >>>> . >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7fea4ffd-ae02-49de-b077-f1e4ae532bef%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

