Re: [tesseract-ocr] Re: Tesseract performance (speed and accuracy)

viraf Tue, 16 Feb 2016 12:46:40 -0800

Thanks - I appreciate your help.  I ran perf tool and noticed that 40% of 
the time is spent in IntegerMatcher::UpdateTablesForFeatures.


Can you try to see if you get the same results on a non mac?  Someone 
suggested that the Mac may automatically use the co-processor.

Thanks

- viraf

On Tuesday, February 16, 2016 at 1:13:01 PM UTC-5, Tom Morris wrote:
>
> Actually, I think the resolution specified in my TIFFs is a red herring 
> and wrong, because the image sizes are the same as your originals. I'm not 
> aware of any standard images and test timings.  There are two test images 
> in the source repo, but they're too small to be useful for any type of 
> performance work.
>
> For the record, here's what my TIFF images look like:
>
> TIFF Directory at offset 0xabd56a (11261290)
>   Image Width: 3400 Image Length: 4401
>   Resolution: 204, 196 pixels/inch
>   Bits/Sample: 1
>   Compression Scheme: CCITT Group 3
>   Photometric Interpretation: min-is-white
>   FillOrder: lsb-to-msb
>   Orientation: row 0 top, col 0 lhs
>   Samples/Pixel: 1
>   Rows/Strip: (infinite)
>   Planar Configuration: single image plane
>   Page Number: 1-0
>   Software: fax2tiff
>   Group 3 Options: (0 = 0x0)
>   Fax Data: clean (0 = 0x0)
>   Bad Fax Lines: 0
>   Consecutive Bad Fax Lines: 0
>
> I don't think there's anything significant difference in the images. Just 
> for grins I reinstalled the 3.04.00 MacPorts version of tesseract and it 
> took 3min21sec for the same file that takes 4min05sec with the current 
> development build, so it doesn't look like there have been any recent 
> performance improvements and perhaps even the opposite (hmmm).
>
> I think I've exhausted my easy suggestions for remote control (free) 
> performance analysis, but I'm interested in hearing what, if anything, you 
> find out.
>
> Tom
>
> On Tue, Feb 16, 2016 at 11:09 AM, viraf <[email protected] 
> <javascript:>> wrote:
>
>> My timings were just for Tesseract to process the image.  I tried using 
>> standard Fax settings which improved processing time to about 8 PPM.  I was 
>> using 300 dpi as per recommendations on many forum postings.  Enclosed is 
>> the tiffinfo for the 
>>
>> TIFF Directory at offset 0x8 (8)
>>   Subfile Type: multi-page document (2 = 0x2)
>>   Image Width: 1728 Image Length: 2292
>>   Resolution: 204, 196 pixels/inch
>>   Bits/Sample: 1
>>   Compression Scheme: CCITT Group 4
>>   Photometric Interpretation: min-is-white
>>   FillOrder: msb-to-lsb
>>   Orientation: row 0 top, col 0 lhs
>>   Samples/Pixel: 1
>>   Rows/Strip: 4969
>>   Planar Configuration: single image plane
>>   Page Number: 0-0
>>   Software: GPL Ghostscript 9.16
>>   DateTime: 2016:02:16 10:43:39
>>   Group 4 Options: (0 = 0x0)
>>
>> I'll look at building a new release - but that has its own challenges as 
>> it is not a release.  Do you have any other suggestions for me to 
>> consider?  Do you know if there are sample images that were used for 
>> testing, where we have some metrics on speed.  This would help me isolate 
>> the problem to the images or to my build.
>>
>> - viraf
>>
>>
>>
>>
>>
>> On Tuesday, February 16, 2016 at 10:31:13 AM UTC-5, Tom Morris wrote:
>>>
>>> My pipeline for this kind of stuff uses:
>>>
>>>     pdfimages - to extract the images
>>>     faxtotiff - to convert CCITT to TIFF (using the parameters file 
>>> generated by pdfimages)
>>>     tiffcp - to concatenate multiple TIFFs together into one big one
>>>
>>> but the important thing is the resulting TIFF. You could try running 
>>> tiffinfo on it to see if anything looks funny.  One thing I wonder about is 
>>> the 300x300 resolution.  My images are the standard (for fax), 204x196 
>>> pixels/inch, so you've got double the pixels to start.  That's likely one 
>>> factor of 2 right there. Having Ghostscript do a full rendering at that 
>>> resolution with the necessary image transforms can't be very fast. My 
>>> pipeline takes 5 seconds for a 110 page document. Also, depending on what 
>>> your starting resolution is, any image scaling is likely degrading the 
>>> image quality.
>>>
>>> It seems unlikely that there have been huge performance changes in the 
>>> last six months, but you could try building from source to see if it makes 
>>> a difference. I'm using the latest 3.05 head sources from Github.
>>>
>>> Tom
>>>
>>> p.s. One caveat - I think faxtotiff, as distributed, is broken and I 
>>> haven't had a chance to contribute my fixes back upstream yet.
>>>
>>> On Tue, Feb 16, 2016 at 9:11 AM, viraf <[email protected]> wrote:
>>>
>>>> I ran a test with a multipage tiiff, and am getting the same results of 
>>>> approximately 6 PPM.  
>>>> I used the following command to create the multipage TIFF
>>>>   gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300 
>>>> /media/sf_shared/00473706.PDF
>>>>
>>>> and ran it under Windows and Linux.  Here is the Linux output:
>>>>
>>>> Tue Feb 16 08:55:14 EST 2016
>>>> Tesseract Open Source OCR Engine v3.04.00 with Leptonica
>>>> Page 1
>>>> Page 2
>>>> Page 3
>>>> Page 4
>>>> Page 5
>>>> Page 6
>>>> Page 7
>>>> Page 8
>>>> Page 9
>>>> Page 10
>>>> Page 11
>>>> Page 12
>>>> Page 13
>>>> OSD: Weak margin (4.51) for 95 blob text block, but using orientation 
>>>> anyway: 0
>>>> Page 14
>>>> Page 15
>>>> Page 16
>>>> Page 17
>>>> Page 18
>>>> Page 19
>>>> OSD: Weak margin (6.28) for 1715 blob text block, but using orientation 
>>>> anyway: 0
>>>> Page 20
>>>> OSD: Weak margin (2.15) for 1383 blob text block, but using orientation 
>>>> anyway: 0
>>>> Page 21
>>>> Page 22
>>>> Tue Feb 16 08:59:24 EST 2016
>>>>
>>>> You had mentioned spending time on image processing, so was wondering 
>>>> what the "OSD Weak Margin" messages mean.  The script used to OCR is
>>>>
>>>> date
>>>> tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr
>>>> date
>>>>
>>>> Any suggestions on where to investigate next would be appreciated.
>>>>
>>>> Thanks
>>>>
>>>> - viraf
>>>>
>>>>
>>>> On Tuesday, February 16, 2016 at 8:17:53 AM UTC-5, viraf wrote:
>>>>>
>>>>> Thanks for the clarification.  I now know that 24 PPM on a single 
>>>>> thread should be achievable.  I'll update the post after trying a few 
>>>>> options.  
>>>>> Thanks for your help.
>>>>>
>>>>> - viraf
>>>>>
>>>>> On Tuesday, February 16, 2016 at 1:53:40 AM UTC-5, Tom Morris wrote:
>>>>>>
>>>>>> On Mon, Feb 15, 2016 at 8:24 PM, viraf <[email protected]> wrote:
>>>>>>
>>>>>>> Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi 
>>>>>>> (1 bit - i.e. BW). Th language is english.  
>>>>>>>
>>>>>>
>>>>>> So, roughly the same resolution and format as I used, but only 1/4 
>>>>>> the speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz 
>>>>>> Intel 
>>>>>> Core i7 (and no, it's not using OpenCL, the GPU, or multiple threads).
>>>>>>  
>>>>>>
>>>>>>> I am using Tess4j 3.0, which includes Tesseract 3.0.4.  I am 
>>>>>>> instantiating a new Tesseract object for each page, however the cost 
>>>>>>> was 
>>>>>>> minimal (74ms) for the total run.  
>>>>>>>
>>>>>>
>>>>>> I'm not familiar with the Tess4J wrapper, but that sounds pretty low 
>>>>>> for initialization cost. Are you sure you're measuring the true cost (ie 
>>>>>> you're not being fooled by lazy initialization)? What happens when you 
>>>>>> combine all the pages into a single multi-page TIFF and OCR it (so you 
>>>>>> can 
>>>>>> be sure you've amortized the initialization cost)?
>>>>>>
>>>>>> When you state "taking a big hit on image processing" how would I be 
>>>>>>> able to isolate the issue to image processing?  
>>>>>>>
>>>>>>
>>>>>> I was mainly talking about operations like thresholding, format 
>>>>>> conversion, etc to get to a usable image.  That's obviously not 
>>>>>> applicable 
>>>>>> if you're working with bitonal images (which you hadn't disclosed when I 
>>>>>> wrote my reply).
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "tesseract-ocr" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7fea4ffd-ae02-49de-b077-f1e4ae532bef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Tesseract performance (speed and accuracy)

Reply via email to