Thanks for the suggestion.  The 4.0 alpha does seem to be providing better 
results out of the box.  I pulled the Windows installer:
tesseract 4.00.00alpha
 leptonica-1.74.1
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : 
libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

Enjoy,


- Clinton Graham
Systems Developer
University of Pittsburgh | University Library System
412-383-1057

On Friday, August 25, 2017 at 7:54:25 AM UTC-4, shree wrote:
>
> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
>
> For the ppa
>
> On 25-Aug-2017 12:45 AM, "ShreeDevi Kumar" <[email protected] 
> <javascript:>> wrote:
>
>> There is an unofficial ppa package available with latest code, if you do 
>> not want to build it.
>>
>> -- Excuse the brevity, msg sent from phone.
>>
>> On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <[email protected] 
>> <javascript:>> wrote:
>>
>>> You can try building latest GitHub source for 4.0alpha and test with the 
>>> best/eng.traineddata from the tessdata repository.
>>>
>>> -- Excuse the brevity, msg sent from phone.
>>>
>>> On 25-Aug-2017 12:36 AM, "Clinton Graham" <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> Do you have any simple suggestions for improving OCR quality where 
>>>> tesseract is missing single character words like "a" and "I"?
>>>>
>>>> I'm using the default packages available in Ubuntu:
>>>> tesseract 3.03
>>>>  leptonica-1.70
>>>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 
>>>> 1.2.8 : webp 0.4.0
>>>>
>>>> I've also tried updating Ubuntu, building later 3.x sources:
>>>> tesseract 3.05.01
>>>>  leptonica-1.74.4
>>>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : 
>>>> zlib 1.2.8
>>>>
>>>> I'm using a command line run of simply:
>>>> tesseract -psm 1 -l eng $f $f pdf
>>>>
>>>> I've also tried -psm 6 based on another forum post (though some of my 
>>>> input will be multicolumn).
>>>>
>>>> In whatever case, the first paragraph of the my TIFF (attached) is 
>>>> consistently read without instances of single character words:
>>>>
>>>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D., 
>>>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate 
>>>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards 
>>>>> Committee was established and its duties were set forth. The Executive 
>>>>> Committee then selected Dr. Robert Ivy to be the first recipient of an 
>>>>> Honors Award. An HOnors and Awards Committee was then selected by the 
>>>>> President; serve as the current chairman. It therefore becomes personal 
>>>>> honor and privilege to me to be able to present this first award to good 
>>>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic 
>>>>> surgery with particular interest in the cleft lip and palate patient. It 
>>>>> will be possible for us to mention only very few of Dr. Ivy’s many 
>>>>> accomplishments in our allotted time here today. would, therefore, like 
>>>>> to 
>>>>> recommend to you two publications which will give you more insight into 
>>>>> the 
>>>>> life of our honored guest.
>>>>>
>>>>
>>>> I'm hoping this sample and description is also representative of other 
>>>> dropped characters, such as single numerals in pagination and single 
>>>> initials in some instances.
>>>>
>>>> Unfortunately, I don't have a lot of time to devote to this project, so 
>>>> anything easy and obvious which I'm missing?
>>>>
>>>> Thanks,
>>>>
>>>> - Clinton Graham
>>>>
>>>> Systems Developer
>>>>
>>>> University of Pittsburgh | University Library System
>>>>
>>>> 412-383-1057
>>>>
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> To post to this group, send email to [email protected] 
>>>> <javascript:>.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8396324b-d630-4ca1-996a-fddd7a73f334%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to