Thanks for the suggestion. The 4.0 alpha does seem to be providing better results out of the box. I pulled the Windows installer: tesseract 4.00.00alpha leptonica-1.74.1 libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
Enjoy, - Clinton Graham Systems Developer University of Pittsburgh | University Library System 412-383-1057 On Friday, August 25, 2017 at 7:54:25 AM UTC-4, shree wrote: > > https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr > > For the ppa > > On 25-Aug-2017 12:45 AM, "ShreeDevi Kumar" <[email protected] > <javascript:>> wrote: > >> There is an unofficial ppa package available with latest code, if you do >> not want to build it. >> >> -- Excuse the brevity, msg sent from phone. >> >> On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <[email protected] >> <javascript:>> wrote: >> >>> You can try building latest GitHub source for 4.0alpha and test with the >>> best/eng.traineddata from the tessdata repository. >>> >>> -- Excuse the brevity, msg sent from phone. >>> >>> On 25-Aug-2017 12:36 AM, "Clinton Graham" <[email protected] >>> <javascript:>> wrote: >>> >>>> Do you have any simple suggestions for improving OCR quality where >>>> tesseract is missing single character words like "a" and "I"? >>>> >>>> I'm using the default packages available in Ubuntu: >>>> tesseract 3.03 >>>> leptonica-1.70 >>>> libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib >>>> 1.2.8 : webp 0.4.0 >>>> >>>> I've also tried updating Ubuntu, building later 3.x sources: >>>> tesseract 3.05.01 >>>> leptonica-1.74.4 >>>> libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : >>>> zlib 1.2.8 >>>> >>>> I'm using a command line run of simply: >>>> tesseract -psm 1 -l eng $f $f pdf >>>> >>>> I've also tried -psm 6 based on another forum post (though some of my >>>> input will be multicolumn). >>>> >>>> In whatever case, the first paragraph of the my TIFF (attached) is >>>> consistently read without instances of single character words: >>>> >>>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D., >>>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate >>>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards >>>>> Committee was established and its duties were set forth. The Executive >>>>> Committee then selected Dr. Robert Ivy to be the first recipient of an >>>>> Honors Award. An HOnors and Awards Committee was then selected by the >>>>> President; serve as the current chairman. It therefore becomes personal >>>>> honor and privilege to me to be able to present this first award to good >>>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic >>>>> surgery with particular interest in the cleft lip and palate patient. It >>>>> will be possible for us to mention only very few of Dr. Ivy’s many >>>>> accomplishments in our allotted time here today. would, therefore, like >>>>> to >>>>> recommend to you two publications which will give you more insight into >>>>> the >>>>> life of our honored guest. >>>>> >>>> >>>> I'm hoping this sample and description is also representative of other >>>> dropped characters, such as single numerals in pagination and single >>>> initials in some instances. >>>> >>>> Unfortunately, I don't have a lot of time to devote to this project, so >>>> anything easy and obvious which I'm missing? >>>> >>>> Thanks, >>>> >>>> - Clinton Graham >>>> >>>> Systems Developer >>>> >>>> University of Pittsburgh | University Library System >>>> >>>> 412-383-1057 >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected] <javascript:>. >>>> To post to this group, send email to [email protected] >>>> <javascript:>. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8396324b-d630-4ca1-996a-fddd7a73f334%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

