Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

Yury Sun, 10 Sep 2017 22:09:13 -0700

Thanks for your hint. 

I installed CygWin and compiled tesseract 4.0 under CygWin. Quality has 
improved significantly. 
However, there was another problem. 
In oem mode 1 or 3 everything works fine. When I choose the modes 0 or 2 I 
get the error:


Failed loading language 'kan'
Tesseract couldn't load any languages!
Could not initialize tesseract.

I set TESSDATA_PREFIX to "/usr/share/tessdata". There are eng, kan, Kannada 
and osr traineddata obtained from best catalog. 
What could be the problem ? These modes do not work in version 4 ?

tesseract 4.00.00alpha
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.30 : 
libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.1.2

 Found AVX
 Found SSE

суббота, 26 августа 2017 г., 0:23:49 UTC+7 пользователь shree написал:
>
> I do not know about internal working of tesseract.
>
> If you unpack the best/kan.traineddata you may find a smaller unicharset 
> which just the basic characters in it.
>
> Tesseract 4 uses the LSTM neural net engine vs the legacy engine for 3.05. 
> LSTM does line based recognition rather than character base.
>
> Yes, it is possible to have both versions installed, however I do not have 
> exact instructions to make it work. It would also depend on what o/s you 
> are using.
>
> I only have the latest GitHub version installed.
>
> On 25-Aug-2017 9:46 PM, "Yury" <[email protected] <javascript:>> wrote:
>
>> ShreeDevi,
>>
>> Thanks for your answers and taking the time.
>>
>> I get traineddata file for 3.04 version (file is little less, but number 
>> of characters is the same - 2851) and get the same result - some symbols is 
>> divided to pair (first is correct and another one is fail).
>> I think to upgrade to 4.00, so I have a questions: 
>>
>> Can I install new version nearby with 3.05, without install ?
>>
>> And another question in the first my post:
>> Did the tesseract have some limitations for number of bytes per character 
>> in unicode ?
>> Are there any additional parameters to remove limitations on the number 
>> of bytes per symbol ?
>>
>> пятница, 25 августа 2017 г., 20:13:22 UTC+7 пользователь shree написал:
>>>
>>> If you are using the 4.0alpha - latest version of program you can use 
>>> kannada traineddata from 
>>>
>>>
>>> https://github.com/tesseract-ocr/tessdata/blob/master/best/kan.traineddata
>>> or
>>>
>>> https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata
>>>
>>> I have not tested kannada personally but if it follows the pattern for 
>>> devanagari, it should be better than the older traineddata.
>>>
>>> If you are using 3.05 version of program,
>>> then use traineddata files from 
>>> https://github.com/tesseract-ocr/tessdata/releases/tag/3.04.00
>>>
>>> Please note that the unicharset and langdata files are used while 
>>> training and just changing the unicharset file is NOT going to improve the 
>>> recognition.
>>>
>>> For that training needs to be done. Please see the wiki for more details.
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Fri, Aug 25, 2017 at 6:31 PM, Yury <[email protected]> wrote:
>>>
>>>> Hello shree!
>>>>
>>>> Thanks for your links and taking the time.
>>>>
>>>> I don't found folder /best/ in ~alex-p profile.
>>>> But I found kan.traineddata in package tesseract-lang-4.00 (in 
>>>> tesseract-lang-3.05 the language Kannada is absent).
>>>> I have to got this file and start recognise - result is the same.
>>>> This package is dated at 08.01.17 and have 2851 characters (as I have).
>>>> So, I thing I used this package earlier.
>>>>
>>>> пятница, 25 августа 2017 г., 18:56:25 UTC+7 пользователь shree написал:
>>>>>
>>>>> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
>>>>>
>>>>> For ppa
>>>>>
>>>>> On 25-Aug-2017 5:22 PM, "ShreeDevi Kumar" <[email protected]> wrote:
>>>>>
>>>>>> Latest GitHub source in master branch is for 4.0alpha. you can 
>>>>>> install via post.
>>>>>>
>>>>>> Search for tesseract PPA Alex in Google.
>>>>>>
>>>>>> _sent from phone
>>>>>>
>>>>>> On 25-Aug-2017 4:42 PM, "Yury" <[email protected]> wrote:
>>>>>>
>>>>>>> Hello again.
>>>>>>>
>>>>>>> I found this: 
>>>>>>> https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata
>>>>>>>
>>>>>>> But after recognition I see only english text symbols and digits, so 
>>>>>>> this did not work.
>>>>>>> In log I see:
>>>>>>>  theraysmith <https://github.com/theraysmith> Added best 
>>>>>>> traineddatas for 4.00 alpha 
>>>>>>> <https://github.com/tesseract-ocr/tessdata/commit/3a94ddd47be01fd897cbe31f05cbd2301454cf8a>
>>>>>>>
>>>>>>> I have 3.05.
>>>>>>>
>>>>>>>
>>>>>>> пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury 
>>>>>>> написал:
>>>>>>>>
>>>>>>>> Hello, shree!
>>>>>>>>
>>>>>>>> Can you tell me exact path for tessdata/best/*.traineddata ?
>>>>>>>>
>>>>>>>> пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree 
>>>>>>>> написал:
>>>>>>>>>
>>>>>>>>> Have you tried the new tessdata/best/*.traineddata with the latest 
>>>>>>>>> github sources?
>>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/b1d6b3c7-79b8-4308-9ac0-7ec1f4e3897c%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b1d6b3c7-79b8-4308-9ac0-7ec1f4e3897c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9b9151da-f025-466a-8ac6-fe3003ad4d48%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/9b9151da-f025-466a-8ac6-fe3003ad4d48%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ecfcb6b6-1d2d-4698-af0c-62fdd422735a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Does unicharset affect recognition quality ?

Reply via email to