[tesseract-ocr] [SOLVED] Using different resolutions for OCR-ing and background image (Was: New user's questions)

Dr Rainer Woitok Mon, 09 Mar 2020 07:43:57 -0700

Greetings,

On Friday, 2020-02-21 18:18:21 +0100, I myseld wrote:


> ...
> after playing a while  with "tesseract"  and after having read plenty of
> manual pages  and documentation on the web  I still have some questions.
> I want to create a PDF file with an OCR layer, but:
> 
> 1. Some of my TIFF files created by "ScanTailor" have light text on dark
>    background,  and documentation says to manually invert such files be-
>    fore feeding them  to current "tesseract" versions.   But of course I
>    want the PDF file to contain the original document with light text on
>    dark background.
> 
> 2. According to the documentation  the TIFF file for OCR-ing should have
>    at least 300 dpi.   But for the background image within the final PDF
>    document I'd like to use a JP2 file with only 150 dpi and a high com-
>    pression rate.
> 
> So is it possible  to pass "tesseract"  a high quality image for OCR-ing
> and a lesser quality image for building the PDF file with?

Sadly though, I didn't receive any answers.  Searching further, I event-
ually found

   https://github.com/tesseract-ocr/tesseract/issues/660

which contains  the developers' discussion  leading to new configuration
variable  "textonly_pdf"  (you'll  need "tesseract" 4.*.*  to use that).
This web page also contains examples  which explain  how to utilize this
option using either "qpdf" or "pdftk".  However, according to my own ex-
perience "qpdf" will only work, if you do NOT resample the original TIFF
files from 300 dpi to 150 dpi but only convert them to JP2 applying los-
sy compression.  If you do resample and use "qpdf", your PDF viewer will
not correctly find the text associated with the area  you highlight with
the mouse, while when using "pdftk" everything will work as expected be-
cause "pdftk" will detect the different widths and heights in pixels and
rescale the overlaid file accordingly.

The code below  assumes the current directory  to be the ScanTailor pro-
ject's "out/" directory containing one TIFF file for every page scanned:

   # neg=-negate   # Uncomment in case of light text on dark background.

   for f in *.tif
   do stm=${f%.tif}

      # Create smaller background image:
      convert $f -resample 150/150 -quality 40 jp2:- |
      img2pdf -o $stm-b.pdf -

      # Use black/white and optionally inverted image for OCR-ing:
      convert $f -threshold 70% $neg tif:-               |
      tesseract - - -l deu --psm 1 -c textonly_pdf=1 pdf |
      pdftk $stm-b.pdf stamp - output $stm-o.pdf
   done

   pdftk *-o.pdf cat output output.pdf
   rm -f *-[bo].pdf

One last word of warning though:  If you're using  "evince"  as your PDF
viewer,  you'll only see empty blue boxes  when you highlight text using
the mouse.  According to

   https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/280

this has been hunted down to some "poppler" problem which seems still to
be open.

Sincerely,
  Rainer

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/24166.22031.420107.89120%40tux.local.

[tesseract-ocr] [SOLVED] Using different resolutions for OCR-ing and background image (Was: New user's questions)

Reply via email to