https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
Rescaling to 300 dpi is also helpful. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Aug 25, 2017 at 5:44 PM, Clinton Graham <ctgra...@pitt.edu> wrote: > Thanks for the suggestion. The 4.0 alpha does seem to be providing better > results out of the box. I pulled the Windows installer: > tesseract 4.00.00alpha > leptonica-1.74.1 > libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : > libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0 > > Enjoy, > > > - Clinton Graham > Systems Developer > University of Pittsburgh | University Library System > 412-383-1057 <(412)%20383-1057> > > On Friday, August 25, 2017 at 7:54:25 AM UTC-4, shree wrote: >> >> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr >> >> For the ppa >> >> On 25-Aug-2017 12:45 AM, "ShreeDevi Kumar" <shree...@gmail.com> wrote: >> >>> There is an unofficial ppa package available with latest code, if you do >>> not want to build it. >>> >>> -- Excuse the brevity, msg sent from phone. >>> >>> On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <shree...@gmail.com> wrote: >>> >>>> You can try building latest GitHub source for 4.0alpha and test with >>>> the best/eng.traineddata from the tessdata repository. >>>> >>>> -- Excuse the brevity, msg sent from phone. >>>> >>>> On 25-Aug-2017 12:36 AM, "Clinton Graham" <ctgr...@pitt.edu> wrote: >>>> >>>>> Do you have any simple suggestions for improving OCR quality where >>>>> tesseract is missing single character words like "a" and "I"? >>>>> >>>>> I'm using the default packages available in Ubuntu: >>>>> tesseract 3.03 >>>>> leptonica-1.70 >>>>> libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib >>>>> 1.2.8 : webp 0.4.0 >>>>> >>>>> I've also tried updating Ubuntu, building later 3.x sources: >>>>> tesseract 3.05.01 >>>>> leptonica-1.74.4 >>>>> libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : >>>>> zlib 1.2.8 >>>>> >>>>> I'm using a command line run of simply: >>>>> tesseract -psm 1 -l eng $f $f pdf >>>>> >>>>> I've also tried -psm 6 based on another forum post (though some of my >>>>> input will be multicolumn). >>>>> >>>>> In whatever case, the first paragraph of the my TIFF (attached) is >>>>> consistently read without instances of single character words: >>>>> >>>>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D., >>>>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate >>>>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards >>>>>> Committee was established and its duties were set forth. The Executive >>>>>> Committee then selected Dr. Robert Ivy to be the first recipient of an >>>>>> Honors Award. An HOnors and Awards Committee was then selected by the >>>>>> President; serve as the current chairman. It therefore becomes personal >>>>>> honor and privilege to me to be able to present this first award to good >>>>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic >>>>>> surgery with particular interest in the cleft lip and palate patient. It >>>>>> will be possible for us to mention only very few of Dr. Ivy’s many >>>>>> accomplishments in our allotted time here today. would, therefore, like >>>>>> to >>>>>> recommend to you two publications which will give you more insight into >>>>>> the >>>>>> life of our honored guest. >>>>>> >>>>> >>>>> I'm hoping this sample and description is also representative of other >>>>> dropped characters, such as single numerals in pagination and single >>>>> initials in some instances. >>>>> >>>>> Unfortunately, I don't have a lot of time to devote to this project, >>>>> so anything easy and obvious which I'm missing? >>>>> >>>>> Thanks, >>>>> >>>>> - Clinton Graham >>>>> >>>>> Systems Developer >>>>> >>>>> University of Pittsburgh | University Library System >>>>> >>>>> 412-383-1057 >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e2 >>>>> 7-4732-b4fe-8d5b78c52d98%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/8396324b-d630-4ca1-996a-fddd7a73f334% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/8396324b-d630-4ca1-996a-fddd7a73f334%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU1YgUDPjmZJcdydmtbvoiF1zM0uzBW7DBrC6zHD33qBg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.