[tesseract-ocr] Re: Training from Scratch

Simon Thu, 23 Nov 2023 01:35:17 -0800

Thanks a lot!
This is not possible with the tesstrain repository right?

[email protected] schrieb am Donnerstag, 23. November 2023 um 10:28:26 
UTC+1:


> If the original model lacks the ∠ symbol, fine tuning is not going to add 
> it for you. We have all went through that process. To introduce a new 
> character, removing the top layer and train from there is the most 
> effective approach.  
>
> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 [email protected] 
> wrote:
>
>> If I need to train new characters that are not recognized by a default 
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity:  ∠
>>
>> This symbols appear in technical drawings and should be recognised in 
>> those. E.g. for the scenario in the following picture tesseract should 
>> reconize this symbol. 
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with: 
>> [image: angularity_0_r0.jpg] 
>> They all look pretty similar to this one. Things that change are the 
>> angle, the propotion and the thickness of the lines. All examples have this 
>> 64x64 pixel box around it. 
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find 
>> information for fine tuning for specific fonts. For fine tune also the 
>> "tesstrain" repository would not be needed as it is used for training from 
>> scratch, correct?
>> [email protected] schrieb am Mittwoch, 22. November 2023 um 15:27:02 
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train 
>>> from scratch. If you can't make more than that data, you might first try to 
>>> fine tune:and then train by removing the top layer of the best model. 
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 [email protected] 
>>> wrote:
>>>
>>>> As it is not properly possible to combine my traineddata from scratch 
>>>> with an existing one, I have decided to also train my traineddata model 
>>>> numbers. Therefore I wrote a script which synthetically generates 
>>>> groundtruth data with text2image. 
>>>> This script uses dozens of different fonts and creates numbers for the 
>>>> following formats. 
>>>> X.XXX
>>>> X.XX
>>>> X,XX
>>>> X,XXX
>>>> I generated 10,000 files to train the numbers. But unfortunately 
>>>> numbers get recognized pretty poorly with the best model. (most of times 
>>>> only "0."; "0" or "0," gets recognized)  
>>>> So I wanted to ask if It is not enough training (ground truth data) for 
>>>> proper recognition when I train several fonts. 
>>>> Thanks in advance for you help. 
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/23835b33-025a-48ad-9037-3eef237393cfn%40googlegroups.com.

[tesseract-ocr] Re: Training from Scratch

Reply via email to