[tesseract-ocr] Re: Training from Scratch

Simon Thu, 23 Nov 2023 01:16:03 -0800

If I need to train new characters that are not recognized by a default 
model, is fine tuning in this case the right approach?
One of these characters ist the one for angularity:  ∠

This symbols appear in technical drawings and should be recognised in 
those. E.g. for the scenario in the following picture tesseract should 
reconize this symbol. 

[image: angularity.png]

Also here is one of the pngs I tried to train with: 
[image: angularity_0_r0.jpg] 
They all look pretty similar to this one. Things that change are the angle, 
the propotion and the thickness of the lines. All examples have this 64x64 
pixel box around it. 

Is Fine Tuning for this scenario the right approach as I only find 
information for fine tuning for specific fonts. For fine tune also the 
"tesstrain" repository would not be needed as it is used for training from 
scratch, correct?
[email protected] schrieb am Mittwoch, 22. November 2023 um 15:27:02 UTC+1:

> From my limited experience, you need a lot more data than that to train 
> from scratch. If you can't make more than that data, you might first try to 
> fine tune:and then train by removing the top layer of the best model. 
>
> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 [email protected] 
> wrote:
>
>> As it is not properly possible to combine my traineddata from scratch 
>> with an existing one, I have decided to also train my traineddata model 
>> numbers. Therefore I wrote a script which synthetically generates 
>> groundtruth data with text2image. 
>> This script uses dozens of different fonts and creates numbers for the 
>> following formats. 
>> X.XXX
>> X.XX
>> X,XX
>> X,XXX
>> I generated 10,000 files to train the numbers. But unfortunately numbers 
>> get recognized pretty poorly with the best model. (most of times only "0."; 
>> "0" or "0," gets recognized)  
>> So I wanted to ask if It is not enough training (ground truth data) for 
>> proper recognition when I train several fonts. 
>> Thanks in advance for you help. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com.

[tesseract-ocr] Re: Training from Scratch

Reply via email to