[tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-24 Thread Yury
Sorry, tesseract version is 3.05.01

пятница, 25 августа 2017 г., 2:06:52 UTC+7 пользователь Yury написал:
>
> I think No.
>
> I call tesseract 5.03 from Python under Win 8 for recognition text on 
> Kannada. 
> The quality of recognition is fine with 80%. However some symbols are 
> divided into 2 halves. One of them is correct, another one is replaced by ಲ.
> Example: ಕಾಂ (one char) recognized as ಕಾಲ (two chars), ನಿಂ recognised as 
> ನಿಲ and so on, although separate chars ಕಾ, ನಿ, ... are recognised correctly.
> I unpacked the file .unicharset from kan.traineddata and tryed to correct 
> character's parameters.
> I summarized width of both chars in pair, added some gap and put it into 
> min/max width (with some deviation). Also I corrected min/max other params 
> from the fine recognition chars.
> After that I overwrote unicharset in existing traineddata and saw no 
> difference.
> I tried so many values and didn't see any changes for recognition.
> In the end I put ten zeros (0,0,0,0,...) in parameters of ಲ char - result 
> is the same (ಲ is recognised as usual).
>
> I think, in the new version of tesseract the quality of recognition 
> doesn't depend on the parameters of unicharset.
>
> So, how can I put some tuning into tesseract ?
> Are there any other methods of management to tesseract ?
> I don't want to learn tesseract over again because I don't have any big 
> text with all characters (my unicharset have 2851 chars).
>
> On the other hand, I noticed that only chars with 1 or 2 bytes' unicode 
> lenght are correctly recognized.  Characters with 3 or more bytes' lenght 
> are not always recognized.
> Are there any additional parameters to remove limitations on the number of 
> bytes per symbol ? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/26fe216a-ed06-4e32-84aa-436ac830101a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Dropped single character words

2017-08-24 Thread ShreeDevi Kumar
There is an unofficial ppa package available with latest code, if you do
not want to build it.

-- Excuse the brevity, msg sent from phone.

On 25-Aug-2017 12:41 AM, "ShreeDevi Kumar"  wrote:

> You can try building latest GitHub source for 4.0alpha and test with the
> best/eng.traineddata from the tessdata repository.
>
> -- Excuse the brevity, msg sent from phone.
>
> On 25-Aug-2017 12:36 AM, "Clinton Graham"  wrote:
>
>> Do you have any simple suggestions for improving OCR quality where
>> tesseract is missing single character words like "a" and "I"?
>>
>> I'm using the default packages available in Ubuntu:
>> tesseract 3.03
>>  leptonica-1.70
>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8 : webp 0.4.0
>>
>> I've also tried updating Ubuntu, building later 3.x sources:
>> tesseract 3.05.01
>>  leptonica-1.74.4
>>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
>> 1.2.8
>>
>> I'm using a command line run of simply:
>> tesseract -psm 1 -l eng $f $f pdf
>>
>> I've also tried -psm 6 based on another forum post (though some of my
>> input will be multicolumn).
>>
>> In whatever case, the first paragraph of the my TIFF (attached) is
>> consistently read without instances of single character words:
>>
>> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D.,
>>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate
>>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards
>>> Committee was established and its duties were set forth. The Executive
>>> Committee then selected Dr. Robert Ivy to be the first recipient of an
>>> Honors Award. An HOnors and Awards Committee was then selected by the
>>> President; serve as the current chairman. It therefore becomes personal
>>> honor and privilege to me to be able to present this first award to good
>>> friend. Dr. Ivy has had long and brilliant career in the field of plastic
>>> surgery with particular interest in the cleft lip and palate patient. It
>>> will be possible for us to mention only very few of Dr. Ivy’s many
>>> accomplishments in our allotted time here today. would, therefore, like to
>>> recommend to you two publications which will give you more insight into the
>>> life of our honored guest.
>>>
>>
>> I'm hoping this sample and description is also representative of other
>> dropped characters, such as single numerals in pagination and single
>> initials in some instances.
>>
>> Unfortunately, I don't have a lot of time to devote to this project, so
>> anything easy and obvious which I'm missing?
>>
>> Thanks,
>>
>> - Clinton Graham
>>
>> Systems Developer
>>
>> University of Pittsburgh | University Library System
>>
>> 412-383-1057 <(412)%20383-1057>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU526DqEtr4LUf%3Dpy3oMbAfGX3Koa_aQ3RJNyTQesD3sA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Dropped single character words

2017-08-24 Thread ShreeDevi Kumar
You can try building latest GitHub source for 4.0alpha and test with the
best/eng.traineddata from the tessdata repository.

-- Excuse the brevity, msg sent from phone.

On 25-Aug-2017 12:36 AM, "Clinton Graham"  wrote:

> Do you have any simple suggestions for improving OCR quality where
> tesseract is missing single character words like "a" and "I"?
>
> I'm using the default packages available in Ubuntu:
> tesseract 3.03
>  leptonica-1.70
>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib
> 1.2.8 : webp 0.4.0
>
> I've also tried updating Ubuntu, building later 3.x sources:
> tesseract 3.05.01
>  leptonica-1.74.4
>   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib
> 1.2.8
>
> I'm using a command line run of simply:
> tesseract -psm 1 -l eng $f $f pdf
>
> I've also tried -psm 6 based on another forum post (though some of my
> input will be multicolumn).
>
> In whatever case, the first paragraph of the my TIFF (attached) is
> consistently read without instances of single character words:
>
> Honors Award {Presentation to Robert H. Ivy, M.D., D.D.S., Sc.D.,
>> F_‘.A.C.S. At the business meeting .of the American Cleft Palate
>> Association on May 6, 1961 in Montreal, Canada, an Honors and Awards
>> Committee was established and its duties were set forth. The Executive
>> Committee then selected Dr. Robert Ivy to be the first recipient of an
>> Honors Award. An HOnors and Awards Committee was then selected by the
>> President; serve as the current chairman. It therefore becomes personal
>> honor and privilege to me to be able to present this first award to good
>> friend. Dr. Ivy has had long and brilliant career in the field of plastic
>> surgery with particular interest in the cleft lip and palate patient. It
>> will be possible for us to mention only very few of Dr. Ivy’s many
>> accomplishments in our allotted time here today. would, therefore, like to
>> recommend to you two publications which will give you more insight into the
>> life of our honored guest.
>>
>
> I'm hoping this sample and description is also representative of other
> dropped characters, such as single numerals in pagination and single
> initials in some instances.
>
> Unfortunately, I don't have a lot of time to devote to this project, so
> anything easy and obvious which I'm missing?
>
> Thanks,
>
> - Clinton Graham
>
> Systems Developer
>
> University of Pittsburgh | University Library System
>
> 412-383-1057 <(412)%20383-1057>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e0b62d2b-2e27-4732-b4fe-8d5b78c52d98%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUqmNDNxz5LgNT6P_mfmHKZXu-p0M5t7XsxyOKGa0bX-A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Does unicharset affect recognition quality ?

2017-08-24 Thread Yury
I think No.

I call tesseract 5.03 from Python under Win 8 for recognition text on 
Kannada. 
The quality of recognition is fine with 80%. However some symbols are 
divided into 2 halves. One of them is correct, another one is replaced by ಲ.
Example: ಕಾಂ (one char) recognized as ಕಾಲ (two chars), ನಿಂ recognised as 
ನಿಲ and so on, although separate chars ಕಾ, ನಿ, ... are recognised correctly.
I unpacked the file .unicharset from kan.traineddata and tryed to correct 
character's parameters.
I summarized width of both chars in pair, added some gap and put it into 
min/max width (with some deviation). Also I corrected min/max other params 
from the fine recognition chars.
After that I overwrote unicharset in existing traineddata and saw no 
difference.
I tried so many values and didn't see any changes for recognition.
In the end I put ten zeros (0,0,0,0,...) in parameters of ಲ char - result 
is the same (ಲ is recognised as usual).

I think, in the new version of tesseract the quality of recognition doesn't 
depend on the parameters of unicharset.

So, how can I put some tuning into tesseract ?
Are there any other methods of management to tesseract ?
I don't want to learn tesseract over again because I don't have any big 
text with all characters (my unicharset have 2851 chars).

On the other hand, I noticed that only chars with 1 or 2 bytes' unicode 
lenght are correctly recognized.  Characters with 3 or more bytes' lenght 
are not always recognized.
Are there any additional parameters to remove limitations on the number of 
bytes per symbol ? 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/88ea2f83-5f72-43b2-b49a-6997604d0f41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.