Hi Chris, I opened the pdfs in Adobe Reader as well as Foxit Reader on Windows7, and the page flickers with large size text but then seems to display normally - zoom 100% also seems to be regular output only.
Tesseract now has a 'pdf' option, so you don't need to do 'hocrpdf'. Try the following: tesseract -l deu -psm 3 "$page" "$page" pdf If you also need hocr, you can give the command as tesseract -l deu -psm 3 "$page" "$page" hocr pdf I'll test later with the git version of tesseract and post the pdfs for you. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Nov 25, 2014 at 10:00 PM, Chris <[email protected]> wrote: > Hi, > no I have only tried with the ubuntu version. > > Here are the samples: > > https://drive.google.com/file/d/0B2kkT1CBqTPCRE1veGtQT3NvSTg/view?usp=sharing > > for page in $(ls $1_out_*.tif); do >> tesseract -l deu -psm 3 "$page" "$page" hocr >> hocr2pdf -i "$page" -s -o "$page.pdf.bak" < "$page.hocr" >> # rm -rf $page >> done >> >> pdftk $1_out_*.tif.pdf.bak cat output "$1.tmp.pdf" >> > > Thank you, > > Chris > > > > On Sunday, November 23, 2014 5:12:12 PM UTC+1, shree wrote: >> >> Have you tried with version compiled from latest source on git? >> >> If you post a couple of sample images I can give a try and let you know >> what results I get. >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Sun, Nov 23, 2014 at 5:00 PM, Chris <[email protected]> wrote: >> >>> Hi Ryan, >>> I run in the same problem. Do you have solved it? >>> >>> Best regards, >>> >>> Chris >>> >>> >>> On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote: >>>> >>>> Hi all, >>>> >>>> I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 >>>> LTS. When I use either hocr or the internal tesseract output for searchable >>>> pdfs I get an oversized font that fills the page too quickly and does not >>>> follow the text in the image. >>>> >>>> I scan the images as tiffs at 300 dpi, then clean up the images using >>>> ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions >>>> slightly altered. After that I perform the ocr. The output is there, but >>>> the font is not aligned properly to the image, as stated above it makes the >>>> font too large and so the text is cut off before the end, and the missing >>>> text does not come up in a search. >>>> >>>> I'm using the stock tesseract package for Ubuntu 14.04. I tried >>>> following the instructions to build the training packages but it errorred >>>> out. >>>> >>>> Version info: >>>> tesseract --version >>>> tesseract 3.03 >>>> leptonica-1.70 >>>> libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib >>>> 1.2.8 : webp 0.4.0 >>>> >>>> Here is a sample of my script for the ocr process using the output from >>>> ScanTailor: >>>> #!/bin/bash >>>> # Run OCR on multiple PDF files and create a new pdf with the >>>> # extracted text in hidden layer. Requires tesseract, hocr2pdf, gs. >>>> # NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03 >>>> # Usage: ./makeit output.pdf >>>> >>>> set -e >>>> output="$1" >>>> dir=`pwd` >>>> >>>> # OCR each page individually and convert into PDF >>>> for page in "$dir"/*page*.tif >>>> do >>>> base="${page%.tif}" >>>> # tesseract "$page" "$base" -l isl hocr >>>> tesseract "$page" "$base.pdf" -l isl # I have also tried adding >>>> -psm 4 here >>>> # Tesseract now outputs searchable pdf on its own >>>> # hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr" >>>> done >>>> >>>> # combine the pages into one PDF >>>> gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" >>>> "$dir"/*page*.pdf >>>> >>>> If anybody could please point out any error I have made or provide a >>>> solution to this problem I would be very grateful. I am trying to get a >>>> copy of a document to a professor of mine, where the original electronic >>>> version of the document was lost. Searchable text is a desirable attribute >>>> of the final result for her. >>>> >>>> >>>> Regards >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVNOBNRrGDv24NyXJtYNXA70BDrTXGjXNZ-d-PAwFdPYA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

