Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
I do not know about internal working of tesseract. If you unpack the best/kan.traineddata you may find a smaller unicharset which just the basic characters in it. Tesseract 4 uses the LSTM neural net engine vs the legacy engine for 3.05. LSTM does line based recognition rather than character

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread Yury
ShreeDevi, Thanks for your answers and taking the time. I get traineddata file for 3.04 version (file is little less, but number of characters is the same - 2851) and get the same result - some symbols is divided to pair (first is correct and another one is fail). I think to upgrade to 4.00,

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
If you are using the 4.0alpha - latest version of program you can use kannada traineddata from https://github.com/tesseract-ocr/tessdata/blob/master/best/kan.traineddata or https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata I have not tested kannada personally but if

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread Yury
Hello shree! Thanks for your links and taking the time. I don't found folder /best/ in ~alex-p profile. But I found kan.traineddata in package tesseract-lang-4.00 (in tesseract-lang-3.05 the language Kannada is absent). I have to got this file and start recognise - result is the same. This

Re: [tesseract-ocr] Dropped single character words

2017-08-25 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality Rescaling to 300 dpi is also helpful. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Aug 25, 2017 at 5:44 PM, Clinton Graham

Re: [tesseract-ocr] Dropped single character words

2017-08-25 Thread Clinton Graham
Thanks for the suggestion. The 4.0 alpha does seem to be providing better results out of the box. I pulled the Windows installer: tesseract 4.00.00alpha leptonica-1.74.1 libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 :

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr For ppa On 25-Aug-2017 5:22 PM, "ShreeDevi Kumar" wrote: > Latest GitHub source in master branch is for 4.0alpha. you can install via > post. > > Search for tesseract PPA Alex in Google. > > _sent from phone > >

Re: [tesseract-ocr] Dropped single character words

2017-08-25 Thread ShreeDevi Kumar
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr For the ppa On 25-Aug-2017 12:45 AM, "ShreeDevi Kumar" wrote: > There is an unofficial ppa package available with latest code, if you do > not want to build it. > > -- Excuse the brevity, msg sent from phone. > >

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
Latest GitHub source in master branch is for 4.0alpha. you can install via post. Search for tesseract PPA Alex in Google. _sent from phone On 25-Aug-2017 4:42 PM, "Yury" wrote: > Hello again. > > I found this: https://github.com/tesseract-ocr/tessdata/blob/ >

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread Yury
Hello again. I found this: https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata But after recognition I see only english text symbols and digits, so this did not work. In log I see: theraysmith Added best traineddatas for 4.00 alpha

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread Yury
Hello, shree! Can you tell me exact path for tessdata/best/*.traineddata ? пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree написал: > > Have you tried the new tessdata/best/*.traineddata with the latest github > sources? > -- You received this message because you are subscribed

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread ShreeDevi Kumar
Have you tried the new tessdata/best/*.traineddata with the latest github sources? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Re: Does unicharset affect recognition quality ?

2017-08-25 Thread Yury
I can add the following. When I accidentally made a mistake in the unicharset, and rewrote it in traineddata, the text did recognize the Latin letters and numbers only (I use -l kan+eng). Thus, unicharset is correct itself, the mechanism of recognition accesses it as needed. пятница, 25