[tesseract-ocr] Searchable PDF output with oversized font

Ryan Johnson Wed, 17 Sep 2014 12:17:38 -0700

Hi all,

I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 LTS. 
When I use either hocr or the internal tesseract output for searchable pdfs 
I get an oversized font that fills the page too quickly and does not follow 
the text in the image.

I scan the images as tiffs at 300 dpi, then clean up the images using
ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions
slightly altered. After that I perform the ocr. The output is there, but
the font is not aligned properly to the image, as stated above it makes the
font too large and so the text is cut off before the end, and the missing
text does not come up in a search.

I'm using the stock tesseract package for Ubuntu 14.04. I tried following
the instructions to build the training packages but it errorred out.

Version info:
tesseract --version
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8
: webp 0.4.0

Here is a sample of my script for the ocr process using the output from
ScanTailor:
#!/bin/bash
# Run OCR on multiple PDF files and create a new pdf with the
# extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
# NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03
# Usage: ./makeit output.pdf

set -e
output="$1"
dir=`pwd`

# OCR each page individually and convert into PDF
for page in "$dir"/*page*.tif
do
base="${page%.tif}"
# tesseract "$page" "$base" -l isl hocr
tesseract "$page" "$base.pdf" -l isl # I have also tried adding
-psm 4 here
# Tesseract now outputs searchable pdf on its own
# hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
done

# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output"
"$dir"/*page*.pdf

If anybody could please point out any error I have made or provide a
solution to this problem I would be very grateful. I am trying to get a
copy of a document to a professor of mine, where the original electronic
version of the document was lost. Searchable text is a desirable attribute
of the final result for her.

Regards

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/aa3013cb-8de2-4898-841b-46fc080d852d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Searchable PDF output with oversized font

Reply via email to