Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

ShreeDevi Kumar Sun, 23 Nov 2014 08:13:00 -0800

Have you tried with version compiled from latest source on git?

If you post a couple of sample images I can give a try and let you know
what results I get.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Nov 23, 2014 at 5:00 PM, Chris <[email protected]> wrote:

> Hi Ryan,
> I run in the same problem. Do you have solved it?
>
> Best regards,
>
> Chris
>
>
> On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote:
>>
>> Hi all,
>>
>> I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04
>> LTS. When I use either hocr or the internal tesseract output for searchable
>> pdfs I get an oversized font that fills the page too quickly and does not
>> follow the text in the image.
>>
>> I scan the images as tiffs at 300 dpi, then clean up the images using
>> ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions
>> slightly altered. After that I perform the ocr. The output is there, but
>> the font is not aligned properly to the image, as stated above it makes the
>> font too large and so the text is cut off before the end, and the missing
>> text does not come up in a search.
>>
>> I'm using the stock tesseract package for Ubuntu 14.04. I tried following
>> the instructions to build the training packages but it errorred out.
>>
>> Version info:
>> tesseract --version
>> tesseract 3.03
>>  leptonica-1.70
>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8 : webp 0.4.0
>>
>> Here is a sample of my script for the ocr process using the output from
>> ScanTailor:
>> #!/bin/bash
>> # Run OCR on multiple PDF files and create a new pdf with the
>> # extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
>> # NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03
>> # Usage: ./makeit output.pdf
>>
>> set -e
>> output="$1"
>> dir=`pwd`
>>
>> # OCR each page individually and convert into PDF
>> for page in "$dir"/*page*.tif
>> do
>>     base="${page%.tif}"
>> #    tesseract "$page" "$base" -l isl hocr
>>     tesseract "$page" "$base.pdf" -l isl     # I have also tried adding
>> -psm 4 here
>> #    Tesseract now outputs searchable pdf on its own
>> #    hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
>> done
>>
>> # combine the pages into one PDF
>> gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output"
>> "$dir"/*page*.pdf
>>
>> If anybody could please point out any error I have made or provide a
>> solution to this problem I would be very grateful. I am trying to get a
>> copy of a document to a professor of mine, where the original electronic
>> version of the document was lost. Searchable text is a desirable attribute
>> of the final result for her.
>>
>>
>> Regards
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWNRrFnSh%2BEu4C704RWLqK3Ndr-nZBO6ibwy_qYxdYfPw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

Reply via email to