Re: [tesseract-ocr] Re: Tesseract performance (speed and accuracy)

Tom Morris Tue, 16 Feb 2016 10:13:59 -0800

Actually, I think the resolution specified in my TIFFs is a red herring and
wrong, because the image sizes are the same as your originals. I'm not
aware of any standard images and test timings.  There are two test images
in the source repo, but they're too small to be useful for any type of
performance work.


For the record, here's what my TIFF images look like:

TIFF Directory at offset 0xabd56a (11261290)
  Image Width: 3400 Image Length: 4401
  Resolution: 204, 196 pixels/inch
  Bits/Sample: 1
  Compression Scheme: CCITT Group 3
  Photometric Interpretation: min-is-white
  FillOrder: lsb-to-msb
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: (infinite)
  Planar Configuration: single image plane
  Page Number: 1-0
  Software: fax2tiff
  Group 3 Options: (0 = 0x0)
  Fax Data: clean (0 = 0x0)
  Bad Fax Lines: 0
  Consecutive Bad Fax Lines: 0

I don't think there's anything significant difference in the images. Just
for grins I reinstalled the 3.04.00 MacPorts version of tesseract and it
took 3min21sec for the same file that takes 4min05sec with the current
development build, so it doesn't look like there have been any recent
performance improvements and perhaps even the opposite (hmmm).

I think I've exhausted my easy suggestions for remote control (free)
performance analysis, but I'm interested in hearing what, if anything, you
find out.

Tom

On Tue, Feb 16, 2016 at 11:09 AM, viraf <[email protected]> wrote:

> My timings were just for Tesseract to process the image.  I tried using
> standard Fax settings which improved processing time to about 8 PPM.  I was
> using 300 dpi as per recommendations on many forum postings.  Enclosed is
> the tiffinfo for the
>
> TIFF Directory at offset 0x8 (8)
>   Subfile Type: multi-page document (2 = 0x2)
>   Image Width: 1728 Image Length: 2292
>   Resolution: 204, 196 pixels/inch
>   Bits/Sample: 1
>   Compression Scheme: CCITT Group 4
>   Photometric Interpretation: min-is-white
>   FillOrder: msb-to-lsb
>   Orientation: row 0 top, col 0 lhs
>   Samples/Pixel: 1
>   Rows/Strip: 4969
>   Planar Configuration: single image plane
>   Page Number: 0-0
>   Software: GPL Ghostscript 9.16
>   DateTime: 2016:02:16 10:43:39
>   Group 4 Options: (0 = 0x0)
>
> I'll look at building a new release - but that has its own challenges as
> it is not a release.  Do you have any other suggestions for me to
> consider?  Do you know if there are sample images that were used for
> testing, where we have some metrics on speed.  This would help me isolate
> the problem to the images or to my build.
>
> - viraf
>
>
>
>
>
> On Tuesday, February 16, 2016 at 10:31:13 AM UTC-5, Tom Morris wrote:
>>
>> My pipeline for this kind of stuff uses:
>>
>>     pdfimages - to extract the images
>>     faxtotiff - to convert CCITT to TIFF (using the parameters file
>> generated by pdfimages)
>>     tiffcp - to concatenate multiple TIFFs together into one big one
>>
>> but the important thing is the resulting TIFF. You could try running
>> tiffinfo on it to see if anything looks funny.  One thing I wonder about is
>> the 300x300 resolution.  My images are the standard (for fax), 204x196
>> pixels/inch, so you've got double the pixels to start.  That's likely one
>> factor of 2 right there. Having Ghostscript do a full rendering at that
>> resolution with the necessary image transforms can't be very fast. My
>> pipeline takes 5 seconds for a 110 page document. Also, depending on what
>> your starting resolution is, any image scaling is likely degrading the
>> image quality.
>>
>> It seems unlikely that there have been huge performance changes in the
>> last six months, but you could try building from source to see if it makes
>> a difference. I'm using the latest 3.05 head sources from Github.
>>
>> Tom
>>
>> p.s. One caveat - I think faxtotiff, as distributed, is broken and I
>> haven't had a chance to contribute my fixes back upstream yet.
>>
>> On Tue, Feb 16, 2016 at 9:11 AM, viraf <[email protected]> wrote:
>>
>>> I ran a test with a multipage tiiff, and am getting the same results of
>>> approximately 6 PPM.
>>> I used the following command to create the multipage TIFF
>>>   gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300
>>> /media/sf_shared/00473706.PDF
>>>
>>> and ran it under Windows and Linux.  Here is the Linux output:
>>>
>>> Tue Feb 16 08:55:14 EST 2016
>>> Tesseract Open Source OCR Engine v3.04.00 with Leptonica
>>> Page 1
>>> Page 2
>>> Page 3
>>> Page 4
>>> Page 5
>>> Page 6
>>> Page 7
>>> Page 8
>>> Page 9
>>> Page 10
>>> Page 11
>>> Page 12
>>> Page 13
>>> OSD: Weak margin (4.51) for 95 blob text block, but using orientation
>>> anyway: 0
>>> Page 14
>>> Page 15
>>> Page 16
>>> Page 17
>>> Page 18
>>> Page 19
>>> OSD: Weak margin (6.28) for 1715 blob text block, but using orientation
>>> anyway: 0
>>> Page 20
>>> OSD: Weak margin (2.15) for 1383 blob text block, but using orientation
>>> anyway: 0
>>> Page 21
>>> Page 22
>>> Tue Feb 16 08:59:24 EST 2016
>>>
>>> You had mentioned spending time on image processing, so was wondering
>>> what the "OSD Weak Margin" messages mean.  The script used to OCR is
>>>
>>> date
>>> tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr
>>> date
>>>
>>> Any suggestions on where to investigate next would be appreciated.
>>>
>>> Thanks
>>>
>>> - viraf
>>>
>>>
>>> On Tuesday, February 16, 2016 at 8:17:53 AM UTC-5, viraf wrote:
>>>>
>>>> Thanks for the clarification.  I now know that 24 PPM on a single
>>>> thread should be achievable.  I'll update the post after trying a few
>>>> options.
>>>> Thanks for your help.
>>>>
>>>> - viraf
>>>>
>>>> On Tuesday, February 16, 2016 at 1:53:40 AM UTC-5, Tom Morris wrote:
>>>>>
>>>>> On Mon, Feb 15, 2016 at 8:24 PM, viraf <[email protected]> wrote:
>>>>>
>>>>>> Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi
>>>>>> (1 bit - i.e. BW). Th language is english.
>>>>>>
>>>>>
>>>>> So, roughly the same resolution and format as I used, but only 1/4 the
>>>>> speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz Intel Core
>>>>> i7 (and no, it's not using OpenCL, the GPU, or multiple threads).
>>>>>
>>>>>
>>>>>> I am using Tess4j 3.0, which includes Tesseract 3.0.4.  I am
>>>>>> instantiating a new Tesseract object for each page, however the cost was
>>>>>> minimal (74ms) for the total run.
>>>>>>
>>>>>
>>>>> I'm not familiar with the Tess4J wrapper, but that sounds pretty low
>>>>> for initialization cost. Are you sure you're measuring the true cost (ie
>>>>> you're not being fooled by lazy initialization)? What happens when you
>>>>> combine all the pages into a single multi-page TIFF and OCR it (so you can
>>>>> be sure you've amortized the initialization cost)?
>>>>>
>>>>> When you state "taking a big hit on image processing" how would I be
>>>>>> able to isolate the issue to image processing?
>>>>>>
>>>>>
>>>>> I was mainly talking about operations like thresholding, format
>>>>> conversion, etc to get to a usable image.  That's obviously not applicable
>>>>> if you're working with bitonal images (which you hadn't disclosed when I
>>>>> wrote my reply).
>>>>>
>>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEEJunM6Jm%2B_AshKE_LLg5%2BsExS091gc%3DWtz89z62J4nGQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Tesseract performance (speed and accuracy)

Reply via email to