Greetings, On Friday, 2020-02-21 18:18:21 +0100, I myseld wrote:
> ... > after playing a while with "tesseract" and after having read plenty of > manual pages and documentation on the web I still have some questions. > I want to create a PDF file with an OCR layer, but: > > 1. Some of my TIFF files created by "ScanTailor" have light text on dark > background, and documentation says to manually invert such files be- > fore feeding them to current "tesseract" versions. But of course I > want the PDF file to contain the original document with light text on > dark background. > > 2. According to the documentation the TIFF file for OCR-ing should have > at least 300 dpi. But for the background image within the final PDF > document I'd like to use a JP2 file with only 150 dpi and a high com- > pression rate. > > So is it possible to pass "tesseract" a high quality image for OCR-ing > and a lesser quality image for building the PDF file with? Sadly though, I didn't receive any answers. Searching further, I event- ually found https://github.com/tesseract-ocr/tesseract/issues/660 which contains the developers' discussion leading to new configuration variable "textonly_pdf" (you'll need "tesseract" 4.*.* to use that). This web page also contains examples which explain how to utilize this option using either "qpdf" or "pdftk". However, according to my own ex- perience "qpdf" will only work, if you do NOT resample the original TIFF files from 300 dpi to 150 dpi but only convert them to JP2 applying los- sy compression. If you do resample and use "qpdf", your PDF viewer will not correctly find the text associated with the area you highlight with the mouse, while when using "pdftk" everything will work as expected be- cause "pdftk" will detect the different widths and heights in pixels and rescale the overlaid file accordingly. The code below assumes the current directory to be the ScanTailor pro- ject's "out/" directory containing one TIFF file for every page scanned: # neg=-negate # Uncomment in case of light text on dark background. for f in *.tif do stm=${f%.tif} # Create smaller background image: convert $f -resample 150/150 -quality 40 jp2:- | img2pdf -o $stm-b.pdf - # Use black/white and optionally inverted image for OCR-ing: convert $f -threshold 70% $neg tif:- | tesseract - - -l deu --psm 1 -c textonly_pdf=1 pdf | pdftk $stm-b.pdf stamp - output $stm-o.pdf done pdftk *-o.pdf cat output output.pdf rm -f *-[bo].pdf One last word of warning though: If you're using "evince" as your PDF viewer, you'll only see empty blue boxes when you highlight text using the mouse. According to https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/280 this has been hunted down to some "poppler" problem which seems still to be open. Sincerely, Rainer -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/24166.22031.420107.89120%40tux.local.

