Re: [tesseract-ocr] OCRs produced by Tesseract differ wildly in size

Art Chimes Tue, 22 Mar 2022 14:23:28 -0700

I have uploaded the relevant files to the Internet Archive, where my
project is housed.


My previous post shortened the file names, as you will see.
In the shaded "DOWNLOAD OPTIONS" box, scroll down to "SHOW ALL" and
click to find the pdf and tif versions.

https://archive.org/details/issues-at-the-u.n.-general-assembly-voa-radio-script
https://archive.org/details/berlin-warnings-voa-radio-script

Thanks for any help you can provide,
Art in Northern Virginia (USA)

On Tue, Mar 22, 2022 at 1:39 AM Zdenko Podobny <[email protected]> wrote:
>
> Can you provide an example tif file?
>
> Zdenko
>
>
> po 21. 3. 2022 o 20:24 ArtmanDC <[email protected]> napísal(a):
>>
>> I am working a project that involves turning text pages from scanned 
>> microfilm into searchable PDFs
>>
>> My workflow is like this —
>>
>> (1) Import raw scan images (*.tif) into Abbyy FineReader v. 12 Professional 
>> for some basic image editing including split, deskew, rough crop, and some 
>> visual cleanup e.g. microfilm dust. Export as multipage .tif. (Most 
>> documents are 2 or 3 pages; a small percentage are 7-8 pages.)
>> (2) Import edited images to Irfanview 4.58 for further editing, normally as 
>> follows
>>    (a) auto crop borders (ctrl-ctrl-Y)
>>    (b) change canvas size (shift-V) using Method 1 to set top and left 
>> margins and then Method 2 to padthe right and bottom margins to achieve 
>> standard starting corner and page size.
>>    (c) light editing to clean up any stray marks (copy/past white background 
>> color to mask marks).
>>    (d) repeat as necessary for subsequent pages. NOTE: As far as I can tell, 
>> changes in multipage tif files have to be saved individually in IrfanView or 
>> changes will be lost when moving to another page.
>> (3) Run edited tif file through Tesseract v5.0.1.20220118 using this format 
>> on the Windows 10 command line:   tesseract input.tif input pdf --psm 4
>>
>> The resulting PDF files were as expected, except for the size relative to 
>> the input tif files.
>>
>> The input files were both two pages and approximately the same size: 3,296 
>> characters for 56143 and 3,194 for 56145.
>>
>> 56143.pdf   998k (2.7 times the size of the tif file)
>> 56143.tif   369k
>> 56145.pdf    94k (half the size of the tif file)
>> 56145.tif   206k
>>
>> I'm not terribly concerned about reducing the PDF file sizes, but I'm just 
>> baffled by why the PDF file size seems to have no relation to the input file 
>> size.
>>
>> I don't know if this is really a Tesseract issue, but since that is the 
>> software that actually generated the PDF I thought this is a good place to 
>> start.
>>
>> Thanks,
>> Art in Northern Virginia

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAM1EaW7Et%2Bxy5J8t6LNdKSg7CUaqkck%3DDDSDDxb-X0KJdOxBLA%40mail.gmail.com.

Re: [tesseract-ocr] OCRs produced by Tesseract differ wildly in size

Reply via email to