Hi all,

I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 LTS. 
When I use either hocr or the internal tesseract output for searchable pdfs 
I get an oversized font that fills the page too quickly and does not follow 
the text in the image.

I scan the images as tiffs at 300 dpi, then clean up the images using 
ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions 
slightly altered. After that I perform the ocr. The output is there, but 
the font is not aligned properly to the image, as stated above it makes the 
font too large and so the text is cut off before the end, and the missing 
text does not come up in a search.

I'm using the stock tesseract package for Ubuntu 14.04. I tried following 
the instructions to build the training packages but it errorred out.

Version info:
tesseract --version
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 
: webp 0.4.0

Here is a sample of my script for the ocr process using the output from 
ScanTailor:
#!/bin/bash
# Run OCR on multiple PDF files and create a new pdf with the
# extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
# NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03
# Usage: ./makeit output.pdf

set -e
output="$1"
dir=`pwd`

# OCR each page individually and convert into PDF
for page in "$dir"/*page*.tif
do
    base="${page%.tif}"
#    tesseract "$page" "$base" -l isl hocr
    tesseract "$page" "$base.pdf" -l isl     # I have also tried adding 
-psm 4 here
#    Tesseract now outputs searchable pdf on its own
#    hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
done

# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" 
"$dir"/*page*.pdf

If anybody could please point out any error I have made or provide a 
solution to this problem I would be very grateful. I am trying to get a 
copy of a document to a professor of mine, where the original electronic 
version of the document was lost. Searchable text is a desirable attribute 
of the final result for her.


Regards

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/aa3013cb-8de2-4898-841b-46fc080d852d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to