There is an unofficial ppa package available with latest code, if you do not want to build it.
-- Excuse the brevity, msg sent from phone. On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <[email protected]> wrote: > You can try building latest GitHub source for 4.0alpha and test with the > best/eng.traineddata from the tessdata repository. > > -- Excuse the brevity, msg sent from phone. > > On 25-Aug-2017 12:36 AM, "Clinton Graham" <[email protected]> wrote: > >> Do you have any simple suggestions for improving OCR quality where >> tesseract is missing single character words like "a" and "I"? >> >> I'm using the default packages available in Ubuntu: >> tesseract 3.03 >> leptonica-1.70 >> libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib >> 1.2.8 : webp 0.4.0 >> >> I've also tried updating Ubuntu, building later 3.x sources: >> tesseract 3.05.01 >> leptonica-1.74.4 >> libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib >> 1.2.8 >> >> I'm using a command line run of simply: >> tesseract -psm 1 -l eng $f $f pdf >> >> I've also tried -psm 6 based on another forum post (though some of my >> input will be multicolumn). >> >> In whatever case, the first paragraph of the my TIFF (attached) is >> consistently read without instances of single character words: >> >> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D., >>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate >>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards >>> Committee was established and its duties were set forth. The Executive >>> Committee then selected Dr. Robert Ivy to be the first recipient of an >>> Honors Award. An HOnors and Awards Committee was then selected by the >>> President; serve as the current chairman. It therefore becomes personal >>> honor and privilege to me to be able to present this first award to good >>> friend. Dr. Ivy has had long and brilliant career in the field of plastic >>> surgery with particular interest in the cleft lip and palate patient. It >>> will be possible for us to mention only very few of Dr. Ivy’s many >>> accomplishments in our allotted time here today. would, therefore, like to >>> recommend to you two publications which will give you more insight into the >>> life of our honored guest. >>> >> >> I'm hoping this sample and description is also representative of other >> dropped characters, such as single numerals in pagination and single >> initials in some instances. >> >> Unfortunately, I don't have a lot of time to devote to this project, so >> anything easy and obvious which I'm missing? >> >> Thanks, >> >> - Clinton Graham >> >> Systems Developer >> >> University of Pittsburgh | University Library System >> >> 412-383-1057 <(412)%20383-1057> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU526DqEtr4LUf%3Dpy3oMbAfGX3Koa_aQ3RJNyTQesD3sA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

