>tiffinfo
19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif
TIFF Directory at offset 0x264004 (40744)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 8
  Compression Scheme: LZW
  Photometric Interpretation: RGB color
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 3
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-1
  Predictor: horizontal differencing 2 (0x2)
TIFF Directory at offset 0x378168 (5c538)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-2

>tiffinfo 19780919-backgrounder56145-berlin_warnings-bill_marsh.tif
TIFF Directory at offset 0x108282 (1a6fa)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-1
TIFF Directory at offset 0x211720 (33b08)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-2

As you see the problem is with the image format in file
19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif.
If you convert the first page to Bits/Sample: 1 (2 colors mode) you will
get a similar output as with the second image:
>ls -l 1978*
-rw-r--r-- 1 user 197121  378410 Mar 28 18:57
19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif
-rw-r--r-- 1 user 197121 1021177 Mar 28 19:00
19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini.tif.pdf
-rw-r--r-- 1 user 197121  218066 Mar 28 19:10
19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif
-rw-r--r-- 1 user 197121   99990 Mar 28 19:11
19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif.pdf
-rw-r--r-- 1 user 197121  211962 Mar 28 18:57
19780919-backgrounder56145-berlin_warnings-bill_marsh.tif
-rw-r--r-- 1 user 197121   95886 Mar 28 19:00
19780919-backgrounder56145-berlin_warnings-bill_marsh.tif.pdf

> tiffinfo
19780916-backgrounder56143-issues_at_the_un_general_assembly-guerrini_.tif
TIFF Directory at offset 0x103678 (194fe)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-1
TIFF Directory at offset 0x217824 (352e0)
  Image Width: 2805 Image Length: 3630
  Resolution: 330, 330 pixels/inch
  Bits/Sample: 1
  Compression Scheme: LZW
  Photometric Interpretation: min-is-white
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 907
  Planar Configuration: single image plane
  Page Number: 0-2


Zdenko


ut 22. 3. 2022 o 22:23 Art Chimes <[email protected]> napísal(a):

> I have uploaded the relevant files to the Internet Archive, where my
> project is housed.
>
> My previous post shortened the file names, as you will see.
> In the shaded "DOWNLOAD OPTIONS" box, scroll down to "SHOW ALL" and
> click to find the pdf and tif versions.
>
>
> https://archive.org/details/issues-at-the-u.n.-general-assembly-voa-radio-script
> https://archive.org/details/berlin-warnings-voa-radio-script
>
> Thanks for any help you can provide,
> Art in Northern Virginia (USA)
>
> On Tue, Mar 22, 2022 at 1:39 AM Zdenko Podobny <[email protected]> wrote:
> >
> > Can you provide an example tif file?
> >
> > Zdenko
> >
> >
> > po 21. 3. 2022 o 20:24 ArtmanDC <[email protected]> napísal(a):
> >>
> >> I am working a project that involves turning text pages from scanned
> microfilm into searchable PDFs
> >>
> >> My workflow is like this —
> >>
> >> (1) Import raw scan images (*.tif) into Abbyy FineReader v. 12
> Professional for some basic image editing including split, deskew, rough
> crop, and some visual cleanup e.g. microfilm dust. Export as multipage
> .tif. (Most documents are 2 or 3 pages; a small percentage are 7-8 pages.)
> >> (2) Import edited images to Irfanview 4.58 for further editing,
> normally as follows
> >>    (a) auto crop borders (ctrl-ctrl-Y)
> >>    (b) change canvas size (shift-V) using Method 1 to set top and left
> margins and then Method 2 to padthe right and bottom margins to achieve
> standard starting corner and page size.
> >>    (c) light editing to clean up any stray marks (copy/past white
> background color to mask marks).
> >>    (d) repeat as necessary for subsequent pages. NOTE: As far as I can
> tell, changes in multipage tif files have to be saved individually in
> IrfanView or changes will be lost when moving to another page.
> >> (3) Run edited tif file through Tesseract v5.0.1.20220118 using this
> format on the Windows 10 command line:   tesseract input.tif input pdf
> --psm 4
> >>
> >> The resulting PDF files were as expected, except for the size relative
> to the input tif files.
> >>
> >> The input files were both two pages and approximately the same size:
> 3,296 characters for 56143 and 3,194 for 56145.
> >>
> >> 56143.pdf   998k (2.7 times the size of the tif file)
> >> 56143.tif   369k
> >> 56145.pdf    94k (half the size of the tif file)
> >> 56145.tif   206k
> >>
> >> I'm not terribly concerned about reducing the PDF file sizes, but I'm
> just baffled by why the PDF file size seems to have no relation to the
> input file size.
> >>
> >> I don't know if this is really a Tesseract issue, but since that is the
> software that actually generated the PDF I thought this is a good place to
> start.
> >>
> >> Thanks,
> >> Art in Northern Virginia
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGZZ7t5B8BoL39i222mFoKG8g5mnQg_vFXaVNe249TXw%40mail.gmail.com.

Reply via email to