Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

ShreeDevi Kumar Tue, 25 Nov 2014 09:45:18 -0800

Hi Chris,

I opened the pdfs in Adobe Reader as well as Foxit Reader on Windows7, and
the page flickers with large size text but then seems to display normally -
zoom 100% also seems to be regular output only.


Tesseract now has a 'pdf' option, so you don't need to do 'hocrpdf'. Try
the following:

 tesseract -l deu -psm 3 "$page" "$page" pdf

If you also need hocr, you can give the command as

 tesseract -l deu -psm 3 "$page" "$page" hocr pdf

I'll test later with the git version of tesseract and post the pdfs for you.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 25, 2014 at 10:00 PM, Chris <[email protected]> wrote:

> Hi,
> no I have only tried with the ubuntu version.
>
> Here are the samples:
>
> https://drive.google.com/file/d/0B2kkT1CBqTPCRE1veGtQT3NvSTg/view?usp=sharing
>
> for page in $(ls $1_out_*.tif); do
>>     tesseract -l deu -psm 3 "$page" "$page" hocr
>>     hocr2pdf -i "$page" -s -o "$page.pdf.bak" < "$page.hocr"
>> #    rm -rf $page
>> done
>>
>> pdftk $1_out_*.tif.pdf.bak cat output "$1.tmp.pdf"
>>
>
> Thank you,
>
> Chris
>
>
>
> On Sunday, November 23, 2014 5:12:12 PM UTC+1, shree wrote:
>>
>> Have you tried with version compiled from latest source on git?
>>
>> If you post a couple of sample images I can give a try and let you know
>> what results I get.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sun, Nov 23, 2014 at 5:00 PM, Chris <[email protected]> wrote:
>>
>>> Hi Ryan,
>>> I run in the same problem. Do you have solved it?
>>>
>>> Best regards,
>>>
>>> Chris
>>>
>>>
>>> On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04
>>>> LTS. When I use either hocr or the internal tesseract output for searchable
>>>> pdfs I get an oversized font that fills the page too quickly and does not
>>>> follow the text in the image.
>>>>
>>>> I scan the images as tiffs at 300 dpi, then clean up the images using
>>>> ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions
>>>> slightly altered. After that I perform the ocr. The output is there, but
>>>> the font is not aligned properly to the image, as stated above it makes the
>>>> font too large and so the text is cut off before the end, and the missing
>>>> text does not come up in a search.
>>>>
>>>> I'm using the stock tesseract package for Ubuntu 14.04. I tried
>>>> following the instructions to build the training packages but it errorred
>>>> out.
>>>>
>>>> Version info:
>>>> tesseract --version
>>>> tesseract 3.03
>>>>  leptonica-1.70
>>>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
>>>> 1.2.8 : webp 0.4.0
>>>>
>>>> Here is a sample of my script for the ocr process using the output from
>>>> ScanTailor:
>>>> #!/bin/bash
>>>> # Run OCR on multiple PDF files and create a new pdf with the
>>>> # extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
>>>> # NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03
>>>> # Usage: ./makeit output.pdf
>>>>
>>>> set -e
>>>> output="$1"
>>>> dir=`pwd`
>>>>
>>>> # OCR each page individually and convert into PDF
>>>> for page in "$dir"/*page*.tif
>>>> do
>>>>     base="${page%.tif}"
>>>> #    tesseract "$page" "$base" -l isl hocr
>>>>     tesseract "$page" "$base.pdf" -l isl     # I have also tried adding
>>>> -psm 4 here
>>>> #    Tesseract now outputs searchable pdf on its own
>>>> #    hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
>>>> done
>>>>
>>>> # combine the pages into one PDF
>>>> gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output"
>>>> "$dir"/*page*.pdf
>>>>
>>>> If anybody could please point out any error I have made or provide a
>>>> solution to this problem I would be very grateful. I am trying to get a
>>>> copy of a document to a professor of mine, where the original electronic
>>>> version of the document was lost. Searchable text is a desirable attribute
>>>> of the final result for her.
>>>>
>>>>
>>>> Regards
>>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVNOBNRrGDv24NyXJtYNXA70BDrTXGjXNZ-d-PAwFdPYA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

Reply via email to