Re: [tesseract-ocr] Re: Tesseract performance (speed and accuracy)

viraf Tue, 16 Feb 2016 06:12:07 -0800

I ran a test with a multipage tiiff, and am getting the same results of 
approximately 6 PPM.  
I used the following command to create the multipage TIFF
  gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300 
/media/sf_shared/00473706.PDF


and ran it under Windows and Linux.  Here is the Linux output:

Tue Feb 16 08:55:14 EST 2016
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
OSD: Weak margin (4.51) for 95 blob text block, but using orientation 
anyway: 0
Page 14
Page 15
Page 16
Page 17
Page 18
Page 19
OSD: Weak margin (6.28) for 1715 blob text block, but using orientation 
anyway: 0
Page 20
OSD: Weak margin (2.15) for 1383 blob text block, but using orientation 
anyway: 0
Page 21
Page 22
Tue Feb 16 08:59:24 EST 2016

You had mentioned spending time on image processing, so was wondering what 
the "OSD Weak Margin" messages mean.  The script used to OCR is

date
tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr
date

Any suggestions on where to investigate next would be appreciated.

Thanks

- viraf


On Tuesday, February 16, 2016 at 8:17:53 AM UTC-5, viraf wrote:
>
> Thanks for the clarification.  I now know that 24 PPM on a single thread 
> should be achievable.  I'll update the post after trying a few options.  
> Thanks for your help.
>
> - viraf
>
> On Tuesday, February 16, 2016 at 1:53:40 AM UTC-5, Tom Morris wrote:
>>
>> On Mon, Feb 15, 2016 at 8:24 PM, viraf <[email protected]> wrote:
>>
>>> Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 
>>> bit - i.e. BW). Th language is english.  
>>>
>>
>> So, roughly the same resolution and format as I used, but only 1/4 the 
>> speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz Intel Core 
>> i7 (and no, it's not using OpenCL, the GPU, or multiple threads).
>>  
>>
>>> I am using Tess4j 3.0, which includes Tesseract 3.0.4.  I am 
>>> instantiating a new Tesseract object for each page, however the cost was 
>>> minimal (74ms) for the total run.  
>>>
>>
>> I'm not familiar with the Tess4J wrapper, but that sounds pretty low for 
>> initialization cost. Are you sure you're measuring the true cost (ie you're 
>> not being fooled by lazy initialization)? What happens when you combine all 
>> the pages into a single multi-page TIFF and OCR it (so you can be sure 
>> you've amortized the initialization cost)?
>>
>> When you state "taking a big hit on image processing" how would I be able 
>>> to isolate the issue to image processing?  
>>>
>>
>> I was mainly talking about operations like thresholding, format 
>> conversion, etc to get to a usable image.  That's obviously not applicable 
>> if you're working with bitonal images (which you hadn't disclosed when I 
>> wrote my reply).
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Tesseract performance (speed and accuracy)

Reply via email to