Can you provide an example tif file?

Zdenko


po 21. 3. 2022 o 20:24 ArtmanDC <[email protected]> napísal(a):

> I am working a project that involves turning text pages from scanned
> microfilm into searchable PDFs
>
> My workflow is like this —
>
> (1) Import raw scan images (*.tif) into Abbyy FineReader v. 12
> Professional for some basic image editing including split, deskew, rough
> crop, and some visual cleanup e.g. microfilm dust. Export as multipage
> .tif. (Most documents are 2 or 3 pages; a small percentage are 7-8 pages.)
> (2) Import edited images to Irfanview 4.58 for further editing, normally
> as follows
>    (a) auto crop borders (ctrl-ctrl-Y)
>    (b) change canvas size (shift-V) using Method 1 to set top and left
> margins and then Method 2 to padthe right and bottom margins to achieve
> standard starting corner and page size.
>    (c) light editing to clean up any stray marks (copy/past white
> background color to mask marks).
>    (d) repeat as necessary for subsequent pages. NOTE: As far as I can
> tell, changes in multipage tif files have to be saved individually in
> IrfanView or changes will be lost when moving to another page.
> (3) Run edited tif file through Tesseract v5.0.1.20220118 using this
> format on the Windows 10 command line:   tesseract input.tif input pdf
> --psm 4
>
> The resulting PDF files were as expected, except for the size relative to
> the input tif files.
>
> The input files were both two pages and approximately the same size: 3,296
> characters for 56143 and 3,194 for 56145.
>
> 56143.pdf   998k (2.7 times the size of the tif file)
> 56143.tif   369k
> 56145.pdf    94k (half the size of the tif file)
> 56145.tif   206k
>
> I'm not terribly concerned about reducing the PDF file sizes, but I'm just
> baffled by why the PDF file size seems to have no relation to the input
> file size.
>
> I don't know if this is really a Tesseract issue, but since that is the
> software that actually generated the PDF I thought this is a good place to
> start.
>
> Thanks,
> Art in Northern Virginia
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x-HsiEdK%3Dzko%3DGqpRs7o%3DA2ZY110LJa4GAWer2Q_bGyA%40mail.gmail.com.

Reply via email to