Re: [tesseract-ocr] Dropped single character words

ShreeDevi Kumar Fri, 25 Aug 2017 05:50:36 -0700

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality


Rescaling to 300 dpi is also helpful.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Aug 25, 2017 at 5:44 PM, Clinton Graham <[email protected]> wrote:

> Thanks for the suggestion.  The 4.0 alpha does seem to be providing better
> results out of the box.  I pulled the Windows installer:
> tesseract 4.00.00alpha
>  leptonica-1.74.1
>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>
> Enjoy,
>
>
> - Clinton Graham
> Systems Developer
> University of Pittsburgh | University Library System
> 412-383-1057 <(412)%20383-1057>
>
> On Friday, August 25, 2017 at 7:54:25 AM UTC-4, shree wrote:
>>
>> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
>>
>> For the ppa
>>
>> On 25-Aug-2017 12:45 AM, "ShreeDevi Kumar" <[email protected]> wrote:
>>
>>> There is an unofficial ppa package available with latest code, if you do
>>> not want to build it.
>>>
>>> -- Excuse the brevity, msg sent from phone.
>>>
>>> On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar" <[email protected]> wrote:
>>>
>>>> You can try building latest GitHub source for 4.0alpha and test with
>>>> the best/eng.traineddata from the tessdata repository.
>>>>
>>>> -- Excuse the brevity, msg sent from phone.
>>>>
>>>> On 25-Aug-2017 12:36 AM, "Clinton Graham" <[email protected]> wrote:
>>>>
>>>>> Do you have any simple suggestions for improving OCR quality where
>>>>> tesseract is missing single character words like "a" and "I"?
>>>>>
>>>>> I'm using the default packages available in Ubuntu:
>>>>> tesseract 3.03
>>>>>  leptonica-1.70
>>>>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
>>>>> 1.2.8 : webp 0.4.0
>>>>>
>>>>> I've also tried updating Ubuntu, building later 3.x sources:
>>>>> tesseract 3.05.01
>>>>>  leptonica-1.74.4
>>>>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 :
>>>>> zlib 1.2.8
>>>>>
>>>>> I'm using a command line run of simply:
>>>>> tesseract -psm 1 -l eng $f $f pdf
>>>>>
>>>>> I've also tried -psm 6 based on another forum post (though some of my
>>>>> input will be multicolumn).
>>>>>
>>>>> In whatever case, the first paragraph of the my TIFF (attached) is
>>>>> consistently read without instances of single character words:
>>>>>
>>>>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D.,
>>>>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate
>>>>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards
>>>>>> Committee was established and its duties were set forth. The Executive
>>>>>> Committee then selected Dr. Robert Ivy to be the first recipient of an
>>>>>> Honors Award. An HOnors and Awards Committee was then selected by the
>>>>>> President; serve as the current chairman. It therefore becomes personal
>>>>>> honor and privilege to me to be able to present this first award to good
>>>>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic
>>>>>> surgery with particular interest in the cleft lip and palate patient. It
>>>>>> will be possible for us to mention only very few of Dr. Ivy’s many
>>>>>> accomplishments in our allotted time here today. would, therefore, like 
>>>>>> to
>>>>>> recommend to you two publications which will give you more insight into 
>>>>>> the
>>>>>> life of our honored guest.
>>>>>>
>>>>>
>>>>> I'm hoping this sample and description is also representative of other
>>>>> dropped characters, such as single numerals in pagination and single
>>>>> initials in some instances.
>>>>>
>>>>> Unfortunately, I don't have a lot of time to devote to this project,
>>>>> so anything easy and obvious which I'm missing?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> - Clinton Graham
>>>>>
>>>>> Systems Developer
>>>>>
>>>>> University of Pittsburgh | University Library System
>>>>>
>>>>> 412-383-1057
>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e2
>>>>> 7-4732-b4fe-8d5b78c52d98%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/8396324b-d630-4ca1-996a-fddd7a73f334%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8396324b-d630-4ca1-996a-fddd7a73f334%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU1YgUDPjmZJcdydmtbvoiF1zM0uzBW7DBrC6zHD33qBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Dropped single character words

Reply via email to