[tesseract-ocr] Re: Searchable PDF output with oversized font

Chris Sun, 23 Nov 2014 06:16:52 -0800

Hi Ryan,
I run in the same problem. Do you have solved it?

Best regards,


Chris


On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote:
>
> Hi all,
>
> I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 
> LTS. When I use either hocr or the internal tesseract output for searchable 
> pdfs I get an oversized font that fills the page too quickly and does not 
> follow the text in the image.
>
> I scan the images as tiffs at 300 dpi, then clean up the images using 
> ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions 
> slightly altered. After that I perform the ocr. The output is there, but 
> the font is not aligned properly to the image, as stated above it makes the 
> font too large and so the text is cut off before the end, and the missing 
> text does not come up in a search.
>
> I'm using the stock tesseract package for Ubuntu 14.04. I tried following 
> the instructions to build the training packages but it errorred out.
>
> Version info:
> tesseract --version
> tesseract 3.03
>  leptonica-1.70
>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 
> 1.2.8 : webp 0.4.0
>
> Here is a sample of my script for the ocr process using the output from 
> ScanTailor:
> #!/bin/bash
> # Run OCR on multiple PDF files and create a new pdf with the
> # extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
> # NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03
> # Usage: ./makeit output.pdf
>
> set -e
> output="$1"
> dir=`pwd`
>
> # OCR each page individually and convert into PDF
> for page in "$dir"/*page*.tif
> do
>     base="${page%.tif}"
> #    tesseract "$page" "$base" -l isl hocr
>     tesseract "$page" "$base.pdf" -l isl     # I have also tried adding 
> -psm 4 here
> #    Tesseract now outputs searchable pdf on its own
> #    hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
> done
>
> # combine the pages into one PDF
> gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" 
> "$dir"/*page*.pdf
>
> If anybody could please point out any error I have made or provide a 
> solution to this problem I would be very grateful. I am trying to get a 
> copy of a document to a professor of mine, where the original electronic 
> version of the document was lost. Searchable text is a desirable attribute 
> of the final result for her.
>
>
> Regards
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Searchable PDF output with oversized font

Reply via email to