Hi, no I have only tried with the ubuntu version. Here are the samples: https://drive.google.com/file/d/0B2kkT1CBqTPCRE1veGtQT3NvSTg/view?usp=sharing
for page in $(ls $1_out_*.tif); do > tesseract -l deu -psm 3 "$page" "$page" hocr > hocr2pdf -i "$page" -s -o "$page.pdf.bak" < "$page.hocr" > # rm -rf $page > done > > pdftk $1_out_*.tif.pdf.bak cat output "$1.tmp.pdf" > Thank you, Chris On Sunday, November 23, 2014 5:12:12 PM UTC+1, shree wrote: > > Have you tried with version compiled from latest source on git? > > If you post a couple of sample images I can give a try and let you know > what results I get. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Sun, Nov 23, 2014 at 5:00 PM, Chris <[email protected] > <javascript:>> wrote: > >> Hi Ryan, >> I run in the same problem. Do you have solved it? >> >> Best regards, >> >> Chris >> >> >> On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote: >>> >>> Hi all, >>> >>> I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 >>> LTS. When I use either hocr or the internal tesseract output for searchable >>> pdfs I get an oversized font that fills the page too quickly and does not >>> follow the text in the image. >>> >>> I scan the images as tiffs at 300 dpi, then clean up the images using >>> ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions >>> slightly altered. After that I perform the ocr. The output is there, but >>> the font is not aligned properly to the image, as stated above it makes the >>> font too large and so the text is cut off before the end, and the missing >>> text does not come up in a search. >>> >>> I'm using the stock tesseract package for Ubuntu 14.04. I tried >>> following the instructions to build the training packages but it errorred >>> out. >>> >>> Version info: >>> tesseract --version >>> tesseract 3.03 >>> leptonica-1.70 >>> libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib >>> 1.2.8 : webp 0.4.0 >>> >>> Here is a sample of my script for the ocr process using the output from >>> ScanTailor: >>> #!/bin/bash >>> # Run OCR on multiple PDF files and create a new pdf with the >>> # extracted text in hidden layer. Requires tesseract, hocr2pdf, gs. >>> # NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03 >>> # Usage: ./makeit output.pdf >>> >>> set -e >>> output="$1" >>> dir=`pwd` >>> >>> # OCR each page individually and convert into PDF >>> for page in "$dir"/*page*.tif >>> do >>> base="${page%.tif}" >>> # tesseract "$page" "$base" -l isl hocr >>> tesseract "$page" "$base.pdf" -l isl # I have also tried adding >>> -psm 4 here >>> # Tesseract now outputs searchable pdf on its own >>> # hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr" >>> done >>> >>> # combine the pages into one PDF >>> gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" >>> "$dir"/*page*.pdf >>> >>> If anybody could please point out any error I have made or provide a >>> solution to this problem I would be very grateful. I am trying to get a >>> copy of a document to a professor of mine, where the original electronic >>> version of the document was lost. Searchable text is a desirable attribute >>> of the final result for her. >>> >>> >>> Regards >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

