Re: [tesseract-ocr] Re: Training from Scratch

Des Bw Fri, 24 Nov 2023 03:12:40 -0800

@zdenop: 
Yes, because the characters start to show up (get recognized) only after 
you run a few thousands of iterations. For me, new characters start to get 
recognized only after I run 5000 iterations. At that point, the base model 
will be deteriorated terribly. It is now a common knowledge that a 
fine-tuning running above 400 iterations highly compromises the base model. 
For that, fine-tuning is not effective to add new characters (even if the 
guide says that is possible).


Dear Zdenop, I would love to be know if there is a way around it. I am 
languishing with tesseract for months now because the default model missed 
one important character.  
On Thursday, November 23, 2023 at 8:59:01 PM UTC+3 zdenop wrote:

>
> št 23. 11. 2023 o 10:28 Des Bw <desal...@gmail.com> napísal(a):
>
>> If the original model lacks the ∠ symbol, fine tuning is not going to 
>> add it for you.
>
>
> Really??? 
> Tesseract documentation 
> <https://github.com/tesseract-ocr/tessdoc/blob/2f4d1e47094acbe3e046144573c928d740595f55/tess4/TrainingTesseract-4.00.md#fine-tuning-for-impact>:
>  
> Fine tuning is the process of training an existing model on new data 
> without changing any part of the network, although you *can* now add 
> characters to the character set. (See Fine Tuning for ± a few characters 
> <https://github.com/tesseract-ocr/tessdoc/blob/2f4d1e47094acbe3e046144573c928d740595f55/tess4/TrainingTesseract-4.00.md#fine-tuning-for--a-few-characters>
> ).
>
>  
>
>> We have all went through that process. To introduce a new character, 
>> removing the top layer and train from there is the most 
>> effective approach.  
>>
>> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com 
>> wrote:
>>
>>> If I need to train new characters that are not recognized by a default 
>>> model, is fine tuning in this case the right approach?
>>> One of these characters ist the one for angularity:  ∠
>>>
>>> This symbols appear in technical drawings and should be recognised in 
>>> those. E.g. for the scenario in the following picture tesseract should 
>>> reconize this symbol. 
>>>
>>>
>>>
>>> [image: angularity.png]
>>>
>>> Also here is one of the pngs I tried to train with: 
>>> [image: angularity_0_r0.jpg] 
>>> They all look pretty similar to this one. Things that change are the 
>>> angle, the propotion and the thickness of the lines. All examples have this 
>>> 64x64 pixel box around it. 
>>>
>>>
>>> Is Fine Tuning for this scenario the right approach as I only find 
>>> information for fine tuning for specific fonts. For fine tune also the 
>>> "tesstrain" repository would not be needed as it is used for training from 
>>> scratch, correct?
>>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
>>> UTC+1:
>>>
>>>> From my limited experience, you need a lot more data than that to train 
>>>> from scratch. If you can't make more than that data, you might first try 
>>>> to 
>>>> fine tune:and then train by removing the top layer of the best model. 
>>>>
>>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
>>>> wrote:
>>>>
>>>>> As it is not properly possible to combine my traineddata from scratch 
>>>>> with an existing one, I have decided to also train my traineddata model 
>>>>> numbers. Therefore I wrote a script which synthetically generates 
>>>>> groundtruth data with text2image. 
>>>>> This script uses dozens of different fonts and creates numbers for the 
>>>>> following formats. 
>>>>> X.XXX
>>>>> X.XX
>>>>> X,XX
>>>>> X,XXX
>>>>> I generated 10,000 files to train the numbers. But unfortunately 
>>>>> numbers get recognized pretty poorly with the best model. (most of times 
>>>>> only "0."; "0" or "0," gets recognized)  
>>>>> So I wanted to ask if It is not enough training (ground truth data) 
>>>>> for proper recognition when I train several fonts. 
>>>>> Thanks in advance for you help. 
>>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e7ce8453-caf3-46ac-ae94-a795ad27fd4fn%40googlegroups.com.

Re: [tesseract-ocr] Re: Training from Scratch

Reply via email to