Re: [tesseract-ocr] Numerous different bugs while training jpn

Kamui 7 Thu, 07 Jan 2021 09:42:46 -0800

I replaced the training text with the one from the official langdata repo 
and now it seems to only produce 30 pages. Is there any place to get the 
training text that the official jpn.traineddata was trained on? 
I have also checked to make sure the fonts support english and japanese as 
well
On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote:


> Your training text file is only 175 lines, so the rendered image fits in 4 
> pages. You need to use a larger text if you want more pages.
>
> Also check that your fonts support both English and Japanese as the text 
> seems to have samples of both languages.
>
> On Thu, Jan 7, 2021, 22:40 Kamui 7 <[email protected]> wrote:
>
>> I did a find command in the root directory and searched for the tesstrain 
>> script. It could only find the script that i pulled from the latest 
>> tesseract git repo. My training script calls that specific tesstrain script 
>> using a relative path so it couldn't be an older version
>>
>> On Thursday, January 7, 2021 at 11:01:55 AM UTC-6 shree wrote:
>>
>>> Old versions of tesstrain.sh used to limit training to 3 pages. Looks 
>>> like you may have an old version in the path somewhere.
>>>
>>> On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 <[email protected]> wrote:
>>>
>>>> I have a script to train tesseract and I ran it on Arch Linux, Debian, 
>>>> and even a docker container and they all produce the same errors. I 
>>>> checked 
>>>> to make sure the script is correct as well. 
>>>>
>>>> Bug 1:
>>>> This happens when tesstrain runs text2image. The max pages parameter 
>>>> does not work at all. It ends up only rendering 4 pages regardless of what 
>>>> I pass in for the maxpages parameter. I even tried hardcoding it into the 
>>>> tesstrain_utils.sh file and it still does the same thing. 
>>>>
>>>> Bug 2:
>>>> After it finishes producing those 4 pages, i finetune it with 
>>>> lstmtraining and the resulting output is full of "Encoding of string 
>>>> failed!" errors.
>>>>
>>>> Bug 3:
>>>> Along with those encoding errors, it also outputs the following text:
>>>>
>>>> "Image too small to scale!! (2x48 vs min width of 3)
>>>> Line cannot be recognized!!
>>>> Image not trainable"
>>>>
>>>> I will upload my script along with the Dockerfile if anyone wants to 
>>>> take a look. 
>>>>
>>>>
>>>> https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a6e42b56-2cff-4f32-a738-4dec81dfc86cn%40googlegroups.com.

Re: [tesseract-ocr] Numerous different bugs while training jpn

Reply via email to