Re: [tesseract-ocr] Re: Training from Scratch

Simon Sat, 25 Nov 2023 03:25:05 -0800

Yes in general I want to recognice this part  "< 0,05 A" except that the < 
ist actually  ∠  the character for angularity.


The segmentation process of tesseract can't be edited right? So you mean I 
would need to make an Tesseract independent program that localizes the 
boxes crops them out and feeds them to Tesseract? In that case I still 
would need to train Tesseract for recognizing  ∠ .  So I am still wondering 
how to train this sign properly. 

Because you asked if the isolation step is able to isolate it, I can check 
this by looking at the hocr information right?



Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:

> Hi Simon,
> if I understand correctly how tesseract works, it follows this steps:
>
> - it segments the image into lines of text
> - it then takes each individual line and slides a small window, 1px wide I 
> think, over it, from one end to the other. For each step the model outputs 
> a prediction. The model, being an bidirectional LSTM has some memory of the 
> previous and following pixel columns.
> - all these predictions are converted into characters using beam search
>
> Please correct me if I got it wrong. So the first thing I think looking at 
> your picture is the segmentation step. Do you want to read the "< 0,05 A" 
> block only? Is the segmentation step able to isolate it? This is the first 
> thing I would try to understand.
> Also your sample image for "<" has a very different angle to the one 
> before 0,05.
>
> In this case a would try to do a custom segmentation, looking for 
> rectangular boxes of a certain height, aspect ratio, etc. Then cropping 
> these out (maybe dropping the rectangular box and the black vertical lines) 
> and feed them to tesseract. This of course requires custom programming.
>
> This might give good results even without fine tuning. I would try this 
> manually with GIMP first.
>
>
> Also I suppose you are not going to encounter a lot of wild fonts into 
> these kind of diagrams. The more fonts you use, the harder the training. I 
> would focus on very few fonts, even one. I would start with exactly one 
> font and train on these to see quickly if my training setup/pipeline is 
> working. And if the training results reflect onto the diagrams later. If 
> the model error rate is good on the individual text lines and it is bad on 
> the real images it might be a segmentation problem that training cannot 
> fix. Or the problem might be the external box, that I suppose you do not 
> have in your generated data.
>
> Ideally, I would use real crops from these diagrams rather than images 
> from text2image.
>
> Also distinguishing 0 from O with many fonts is very hard. Often you have 
> domain knowledge that can help you to fix these errors in post, for example 
> 0,O5 can be easily spotted and fixed. You can, for example, assume that 
> each box contains only one kind of data and guess the most likely one from 
> this or from the box sequence, etc.
>
> I got good results with 20k samples (real world scanned docs, multi 
> fonts). 10k seems reasonable, I also assume your output "characters set" is 
> very small, like the numbers and a few capital letters and a couple of 
> symbols (no %, ^, &, {, etc.).
>
>
>
> Lorenzo
>
> Il giorno gio 23 nov 2023 alle ore 10:16 Simon <smon...@gmail.com> ha 
> scritto:
>
>> If I need to train new characters that are not recognized by a default 
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity:  ∠
>>
>> This symbols appear in technical drawings and should be recognised in 
>> those. E.g. for the scenario in the following picture tesseract should 
>> reconize this symbol. 
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with: 
>> [image: angularity_0_r0.jpg] 
>> They all look pretty similar to this one. Things that change are the 
>> angle, the propotion and the thickness of the lines. All examples have this 
>> 64x64 pixel box around it. 
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find 
>> information for fine tuning for specific fonts. For fine tune also the 
>> "tesstrain" repository would not be needed as it is used for training from 
>> scratch, correct?
>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train 
>>> from scratch. If you can't make more than that data, you might first try to 
>>> fine tune:and then train by removing the top layer of the best model. 
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
>>> wrote:
>>>
>>>> As it is not properly possible to combine my traineddata from scratch 
>>>> with an existing one, I have decided to also train my traineddata model 
>>>> numbers. Therefore I wrote a script which synthetically generates 
>>>> groundtruth data with text2image. 
>>>> This script uses dozens of different fonts and creates numbers for the 
>>>> following formats. 
>>>> X.XXX
>>>> X.XX
>>>> X,XX
>>>> X,XXX
>>>> I generated 10,000 files to train the numbers. But unfortunately 
>>>> numbers get recognized pretty poorly with the best model. (most of times 
>>>> only "0."; "0" or "0," gets recognized)  
>>>> So I wanted to ask if It is not enough training (ground truth data) for 
>>>> proper recognition when I train several fonts. 
>>>> Thanks in advance for you help. 
>>>>
>>> -- 
>>
> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/31d6a1f5-d114-485b-b6b3-897c57616783n%40googlegroups.com.

Re: [tesseract-ocr] Re: Training from Scratch

Reply via email to