>tiffinfo 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif TIFF Directory at offset 0x264004 (40744) Image Width: 2805 Image Length: 3630 Resolution: 330, 330 pixels/inch Bits/Sample: 8 Compression Scheme: LZW Photometric Interpretation: RGB color Orientation: row 0 top, col 0 lhs Samples/Pixel: 3 Rows/Strip: 907 Planar Configuration: single image plane Page Number: 0-1 Predictor: horizontal differencing 2 (0x2) TIFF Directory at offset 0x378168 (5c538) Image Width: 2805 Image Length: 3630 Resolution: 330, 330 pixels/inch Bits/Sample: 1 Compression Scheme: LZW Photometric Interpretation: min-is-white Orientation: row 0 top, col 0 lhs Samples/Pixel: 1 Rows/Strip: 907 Planar Configuration: single image plane Page Number: 0-2
>tiffinfo 19780919-backgrounder56145-berlin_warnings-bill_marsh.tif TIFF Directory at offset 0x108282 (1a6fa) Image Width: 2805 Image Length: 3630 Resolution: 330, 330 pixels/inch Bits/Sample: 1 Compression Scheme: LZW Photometric Interpretation: min-is-white Orientation: row 0 top, col 0 lhs Samples/Pixel: 1 Rows/Strip: 907 Planar Configuration: single image plane Page Number: 0-1 TIFF Directory at offset 0x211720 (33b08) Image Width: 2805 Image Length: 3630 Resolution: 330, 330 pixels/inch Bits/Sample: 1 Compression Scheme: LZW Photometric Interpretation: min-is-white Orientation: row 0 top, col 0 lhs Samples/Pixel: 1 Rows/Strip: 907 Planar Configuration: single image plane Page Number: 0-2 As you see the problem is with the image format in file 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif. If you convert the first page to Bits/Sample: 1 (2 colors mode) you will get a similar output as with the second image: >ls -l 1978* -rw-r--r-- 1 user 197121 378410 Mar 28 18:57 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif -rw-r--r-- 1 user 197121 1021177 Mar 28 19:00 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif.pdf -rw-r--r-- 1 user 197121 218066 Mar 28 19:10 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif -rw-r--r-- 1 user 197121 99990 Mar 28 19:11 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif.pdf -rw-r--r-- 1 user 197121 211962 Mar 28 18:57 19780919-backgrounder56145-berlin_warnings-bill_marsh.tif -rw-r--r-- 1 user 197121 95886 Mar 28 19:00 19780919-backgrounder56145-berlin_warnings-bill_marsh.tif.pdf > tiffinfo 19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif TIFF Directory at offset 0x103678 (194fe) Image Width: 2805 Image Length: 3630 Resolution: 330, 330 pixels/inch Bits/Sample: 1 Compression Scheme: LZW Photometric Interpretation: min-is-white Orientation: row 0 top, col 0 lhs Samples/Pixel: 1 Rows/Strip: 907 Planar Configuration: single image plane Page Number: 0-1 TIFF Directory at offset 0x217824 (352e0) Image Width: 2805 Image Length: 3630 Resolution: 330, 330 pixels/inch Bits/Sample: 1 Compression Scheme: LZW Photometric Interpretation: min-is-white Orientation: row 0 top, col 0 lhs Samples/Pixel: 1 Rows/Strip: 907 Planar Configuration: single image plane Page Number: 0-2 Zdenko ut 22. 3. 2022 o 22:23 Art Chimes <[email protected]> napísal(a): > I have uploaded the relevant files to the Internet Archive, where my > project is housed. > > My previous post shortened the file names, as you will see. > In the shaded "DOWNLOAD OPTIONS" box, scroll down to "SHOW ALL" and > click to find the pdf and tif versions. > > > https://archive.org/details/issues-at-the-u.n.-general-assembly-voa-radio-script > https://archive.org/details/berlin-warnings-voa-radio-script > > Thanks for any help you can provide, > Art in Northern Virginia (USA) > > On Tue, Mar 22, 2022 at 1:39 AM Zdenko Podobny <[email protected]> wrote: > > > > Can you provide an example tif file? > > > > Zdenko > > > > > > po 21. 3. 2022 o 20:24 ArtmanDC <[email protected]> napísal(a): > >> > >> I am working a project that involves turning text pages from scanned > microfilm into searchable PDFs > >> > >> My workflow is like this — > >> > >> (1) Import raw scan images (*.tif) into Abbyy FineReader v. 12 > Professional for some basic image editing including split, deskew, rough > crop, and some visual cleanup e.g. microfilm dust. Export as multipage > .tif. (Most documents are 2 or 3 pages; a small percentage are 7-8 pages.) > >> (2) Import edited images to Irfanview 4.58 for further editing, > normally as follows > >> (a) auto crop borders (ctrl-ctrl-Y) > >> (b) change canvas size (shift-V) using Method 1 to set top and left > margins and then Method 2 to padthe right and bottom margins to achieve > standard starting corner and page size. > >> (c) light editing to clean up any stray marks (copy/past white > background color to mask marks). > >> (d) repeat as necessary for subsequent pages. NOTE: As far as I can > tell, changes in multipage tif files have to be saved individually in > IrfanView or changes will be lost when moving to another page. > >> (3) Run edited tif file through Tesseract v5.0.1.20220118 using this > format on the Windows 10 command line: tesseract input.tif input pdf > --psm 4 > >> > >> The resulting PDF files were as expected, except for the size relative > to the input tif files. > >> > >> The input files were both two pages and approximately the same size: > 3,296 characters for 56143 and 3,194 for 56145. > >> > >> 56143.pdf 998k (2.7 times the size of the tif file) > >> 56143.tif 369k > >> 56145.pdf 94k (half the size of the tif file) > >> 56145.tif 206k > >> > >> I'm not terribly concerned about reducing the PDF file sizes, but I'm > just baffled by why the PDF file size seems to have no relation to the > input file size. > >> > >> I don't know if this is really a Tesseract issue, but since that is the > software that actually generated the PDF I thought this is a good place to > start. > >> > >> Thanks, > >> Art in Northern Virginia > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGZZ7t5B8BoL39i222mFoKG8g5mnQg_vFXaVNe249TXw%40mail.gmail.com.

