My pipeline for this kind of stuff uses:

    pdfimages - to extract the images
    faxtotiff - to convert CCITT to TIFF (using the parameters file
generated by pdfimages)
    tiffcp - to concatenate multiple TIFFs together into one big one

but the important thing is the resulting TIFF. You could try running
tiffinfo on it to see if anything looks funny.  One thing I wonder about is
the 300x300 resolution.  My images are the standard (for fax), 204x196
pixels/inch, so you've got double the pixels to start.  That's likely one
factor of 2 right there. Having Ghostscript do a full rendering at that
resolution with the necessary image transforms can't be very fast. My
pipeline takes 5 seconds for a 110 page document. Also, depending on what
your starting resolution is, any image scaling is likely degrading the
image quality.

It seems unlikely that there have been huge performance changes in the last
six months, but you could try building from source to see if it makes a
difference. I'm using the latest 3.05 head sources from Github.

Tom

p.s. One caveat - I think faxtotiff, as distributed, is broken and I
haven't had a chance to contribute my fixes back upstream yet.

On Tue, Feb 16, 2016 at 9:11 AM, viraf <[email protected]> wrote:

> I ran a test with a multipage tiiff, and am getting the same results of
> approximately 6 PPM.
> I used the following command to create the multipage TIFF
>   gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300
> /media/sf_shared/00473706.PDF
>
> and ran it under Windows and Linux.  Here is the Linux output:
>
> Tue Feb 16 08:55:14 EST 2016
> Tesseract Open Source OCR Engine v3.04.00 with Leptonica
> Page 1
> Page 2
> Page 3
> Page 4
> Page 5
> Page 6
> Page 7
> Page 8
> Page 9
> Page 10
> Page 11
> Page 12
> Page 13
> OSD: Weak margin (4.51) for 95 blob text block, but using orientation
> anyway: 0
> Page 14
> Page 15
> Page 16
> Page 17
> Page 18
> Page 19
> OSD: Weak margin (6.28) for 1715 blob text block, but using orientation
> anyway: 0
> Page 20
> OSD: Weak margin (2.15) for 1383 blob text block, but using orientation
> anyway: 0
> Page 21
> Page 22
> Tue Feb 16 08:59:24 EST 2016
>
> You had mentioned spending time on image processing, so was wondering what
> the "OSD Weak Margin" messages mean.  The script used to OCR is
>
> date
> tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr
> date
>
> Any suggestions on where to investigate next would be appreciated.
>
> Thanks
>
> - viraf
>
>
> On Tuesday, February 16, 2016 at 8:17:53 AM UTC-5, viraf wrote:
>>
>> Thanks for the clarification.  I now know that 24 PPM on a single thread
>> should be achievable.  I'll update the post after trying a few options.
>> Thanks for your help.
>>
>> - viraf
>>
>> On Tuesday, February 16, 2016 at 1:53:40 AM UTC-5, Tom Morris wrote:
>>>
>>> On Mon, Feb 15, 2016 at 8:24 PM, viraf <[email protected]> wrote:
>>>
>>>> Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1
>>>> bit - i.e. BW). Th language is english.
>>>>
>>>
>>> So, roughly the same resolution and format as I used, but only 1/4 the
>>> speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz Intel Core
>>> i7 (and no, it's not using OpenCL, the GPU, or multiple threads).
>>>
>>>
>>>> I am using Tess4j 3.0, which includes Tesseract 3.0.4.  I am
>>>> instantiating a new Tesseract object for each page, however the cost was
>>>> minimal (74ms) for the total run.
>>>>
>>>
>>> I'm not familiar with the Tess4J wrapper, but that sounds pretty low for
>>> initialization cost. Are you sure you're measuring the true cost (ie you're
>>> not being fooled by lazy initialization)? What happens when you combine all
>>> the pages into a single multi-page TIFF and OCR it (so you can be sure
>>> you've amortized the initialization cost)?
>>>
>>> When you state "taking a big hit on image processing" how would I be
>>>> able to isolate the issue to image processing?
>>>>
>>>
>>> I was mainly talking about operations like thresholding, format
>>> conversion, etc to get to a usable image.  That's obviously not applicable
>>> if you're working with bitonal images (which you hadn't disclosed when I
>>> wrote my reply).
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEFY6%2BC8gsUDwpbiBEYqJL4hy6OX%3Da0rTMROTFHCw9Bj9Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to