Can you provide an example tif file? Zdenko
po 21. 3. 2022 o 20:24 ArtmanDC <[email protected]> napísal(a): > I am working a project that involves turning text pages from scanned > microfilm into searchable PDFs > > My workflow is like this — > > (1) Import raw scan images (*.tif) into Abbyy FineReader v. 12 > Professional for some basic image editing including split, deskew, rough > crop, and some visual cleanup e.g. microfilm dust. Export as multipage > .tif. (Most documents are 2 or 3 pages; a small percentage are 7-8 pages.) > (2) Import edited images to Irfanview 4.58 for further editing, normally > as follows > (a) auto crop borders (ctrl-ctrl-Y) > (b) change canvas size (shift-V) using Method 1 to set top and left > margins and then Method 2 to padthe right and bottom margins to achieve > standard starting corner and page size. > (c) light editing to clean up any stray marks (copy/past white > background color to mask marks). > (d) repeat as necessary for subsequent pages. NOTE: As far as I can > tell, changes in multipage tif files have to be saved individually in > IrfanView or changes will be lost when moving to another page. > (3) Run edited tif file through Tesseract v5.0.1.20220118 using this > format on the Windows 10 command line: tesseract input.tif input pdf > --psm 4 > > The resulting PDF files were as expected, except for the size relative to > the input tif files. > > The input files were both two pages and approximately the same size: 3,296 > characters for 56143 and 3,194 for 56145. > > 56143.pdf 998k (2.7 times the size of the tif file) > 56143.tif 369k > 56145.pdf 94k (half the size of the tif file) > 56145.tif 206k > > I'm not terribly concerned about reducing the PDF file sizes, but I'm just > baffled by why the PDF file size seems to have no relation to the input > file size. > > I don't know if this is really a Tesseract issue, but since that is the > software that actually generated the PDF I thought this is a good place to > start. > > Thanks, > Art in Northern Virginia > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/fe64cc77-a08e-4362-9dba-545532037108n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x-HsiEdK%3Dzko%3DGqpRs7o%3DA2ZY110LJa4GAWer2Q_bGyA%40mail.gmail.com.

