Thanks for your hint. I installed CygWin and compiled tesseract 4.0 under CygWin. Quality has improved significantly. However, there was another problem. In oem mode 1 or 3 everything works fine. When I choose the modes 0 or 2 I get the error:
Failed loading language 'kan' Tesseract couldn't load any languages! Could not initialize tesseract. I set TESSDATA_PREFIX to "/usr/share/tessdata". There are eng, kan, Kannada and osr traineddata obtained from best catalog. What could be the problem ? These modes do not work in version 4 ? tesseract 4.00.00alpha leptonica-1.74.4 libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.30 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.1.2 Found AVX Found SSE суббота, 26 августа 2017 г., 0:23:49 UTC+7 пользователь shree написал: > > I do not know about internal working of tesseract. > > If you unpack the best/kan.traineddata you may find a smaller unicharset > which just the basic characters in it. > > Tesseract 4 uses the LSTM neural net engine vs the legacy engine for 3.05. > LSTM does line based recognition rather than character base. > > Yes, it is possible to have both versions installed, however I do not have > exact instructions to make it work. It would also depend on what o/s you > are using. > > I only have the latest GitHub version installed. > > On 25-Aug-2017 9:46 PM, "Yury" <[email protected] <javascript:>> wrote: > >> ShreeDevi, >> >> Thanks for your answers and taking the time. >> >> I get traineddata file for 3.04 version (file is little less, but number >> of characters is the same - 2851) and get the same result - some symbols is >> divided to pair (first is correct and another one is fail). >> I think to upgrade to 4.00, so I have a questions: >> >> Can I install new version nearby with 3.05, without install ? >> >> And another question in the first my post: >> Did the tesseract have some limitations for number of bytes per character >> in unicode ? >> Are there any additional parameters to remove limitations on the number >> of bytes per symbol ? >> >> пятница, 25 августа 2017 г., 20:13:22 UTC+7 пользователь shree написал: >>> >>> If you are using the 4.0alpha - latest version of program you can use >>> kannada traineddata from >>> >>> >>> https://github.com/tesseract-ocr/tessdata/blob/master/best/kan.traineddata >>> or >>> >>> https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata >>> >>> I have not tested kannada personally but if it follows the pattern for >>> devanagari, it should be better than the older traineddata. >>> >>> If you are using 3.05 version of program, >>> then use traineddata files from >>> https://github.com/tesseract-ocr/tessdata/releases/tag/3.04.00 >>> >>> Please note that the unicharset and langdata files are used while >>> training and just changing the unicharset file is NOT going to improve the >>> recognition. >>> >>> For that training needs to be done. Please see the wiki for more details. >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Fri, Aug 25, 2017 at 6:31 PM, Yury <[email protected]> wrote: >>> >>>> Hello shree! >>>> >>>> Thanks for your links and taking the time. >>>> >>>> I don't found folder /best/ in ~alex-p profile. >>>> But I found kan.traineddata in package tesseract-lang-4.00 (in >>>> tesseract-lang-3.05 the language Kannada is absent). >>>> I have to got this file and start recognise - result is the same. >>>> This package is dated at 08.01.17 and have 2851 characters (as I have). >>>> So, I thing I used this package earlier. >>>> >>>> пятница, 25 августа 2017 г., 18:56:25 UTC+7 пользователь shree написал: >>>>> >>>>> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr >>>>> >>>>> For ppa >>>>> >>>>> On 25-Aug-2017 5:22 PM, "ShreeDevi Kumar" <[email protected]> wrote: >>>>> >>>>>> Latest GitHub source in master branch is for 4.0alpha. you can >>>>>> install via post. >>>>>> >>>>>> Search for tesseract PPA Alex in Google. >>>>>> >>>>>> _sent from phone >>>>>> >>>>>> On 25-Aug-2017 4:42 PM, "Yury" <[email protected]> wrote: >>>>>> >>>>>>> Hello again. >>>>>>> >>>>>>> I found this: >>>>>>> https://github.com/tesseract-ocr/tessdata/blob/master/best/Kannada.traineddata >>>>>>> >>>>>>> But after recognition I see only english text symbols and digits, so >>>>>>> this did not work. >>>>>>> In log I see: >>>>>>> theraysmith <https://github.com/theraysmith> Added best >>>>>>> traineddatas for 4.00 alpha >>>>>>> <https://github.com/tesseract-ocr/tessdata/commit/3a94ddd47be01fd897cbe31f05cbd2301454cf8a> >>>>>>> >>>>>>> I have 3.05. >>>>>>> >>>>>>> >>>>>>> пятница, 25 августа 2017 г., 17:47:56 UTC+7 пользователь Yury >>>>>>> написал: >>>>>>>> >>>>>>>> Hello, shree! >>>>>>>> >>>>>>>> Can you tell me exact path for tessdata/best/*.traineddata ? >>>>>>>> >>>>>>>> пятница, 25 августа 2017 г., 16:07:49 UTC+7 пользователь shree >>>>>>>> написал: >>>>>>>>> >>>>>>>>> Have you tried the new tessdata/best/*.traineddata with the latest >>>>>>>>> github sources? >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b20f906b-db90-43f1-b9c6-b1bb40d21414%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/b1d6b3c7-79b8-4308-9ac0-7ec1f4e3897c%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/b1d6b3c7-79b8-4308-9ac0-7ec1f4e3897c%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/9b9151da-f025-466a-8ac6-fe3003ad4d48%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/9b9151da-f025-466a-8ac6-fe3003ad4d48%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ecfcb6b6-1d2d-4698-af0c-62fdd422735a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

