Have you tried with version compiled from latest source on git? If you post a couple of sample images I can give a try and let you know what results I get.
ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sun, Nov 23, 2014 at 5:00 PM, Chris <[email protected]> wrote: > Hi Ryan, > I run in the same problem. Do you have solved it? > > Best regards, > > Chris > > > On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote: >> >> Hi all, >> >> I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 >> LTS. When I use either hocr or the internal tesseract output for searchable >> pdfs I get an oversized font that fills the page too quickly and does not >> follow the text in the image. >> >> I scan the images as tiffs at 300 dpi, then clean up the images using >> ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions >> slightly altered. After that I perform the ocr. The output is there, but >> the font is not aligned properly to the image, as stated above it makes the >> font too large and so the text is cut off before the end, and the missing >> text does not come up in a search. >> >> I'm using the stock tesseract package for Ubuntu 14.04. I tried following >> the instructions to build the training packages but it errorred out. >> >> Version info: >> tesseract --version >> tesseract 3.03 >> leptonica-1.70 >> libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib >> 1.2.8 : webp 0.4.0 >> >> Here is a sample of my script for the ocr process using the output from >> ScanTailor: >> #!/bin/bash >> # Run OCR on multiple PDF files and create a new pdf with the >> # extracted text in hidden layer. Requires tesseract, hocr2pdf, gs. >> # NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03 >> # Usage: ./makeit output.pdf >> >> set -e >> output="$1" >> dir=`pwd` >> >> # OCR each page individually and convert into PDF >> for page in "$dir"/*page*.tif >> do >> base="${page%.tif}" >> # tesseract "$page" "$base" -l isl hocr >> tesseract "$page" "$base.pdf" -l isl # I have also tried adding >> -psm 4 here >> # Tesseract now outputs searchable pdf on its own >> # hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr" >> done >> >> # combine the pages into one PDF >> gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" >> "$dir"/*page*.pdf >> >> If anybody could please point out any error I have made or provide a >> solution to this problem I would be very grateful. I am trying to get a >> copy of a document to a professor of mine, where the original electronic >> version of the document was lost. Searchable text is a desirable attribute >> of the final result for her. >> >> >> Regards >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWNRrFnSh%2BEu4C704RWLqK3Ndr-nZBO6ibwy_qYxdYfPw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

