Re: [tesseract-ocr] Re: Training from Scratch

2023-11-29 Thread Simon
Hey Lorenzo,

thanks a lot for your response. I've seen in the HOCR files of different 
technical drawings that the Tesseract Text Segmentation has massive 
problems recognizing zones with text, probably because of the varios lines 
and complex constructions within the technical drawing. Even the zones 
where text appears get recognized very rarely. So it seems pretty obvious 
to me that no Tesseract is not build for documents where no clear text 
lines are 
Therefore I decided to follow your suggestion to crop out the boxes 
(Feature Control Frame) and feed them seperately to Tesseract. To identify 
those boxes I would try to use OpenCV. I also try to generate training data 
which should be similar to these Feature Control Frames for the training of 
Tesseract. Do you think this approach could be successfull?


Lorenzo Blz schrieb am Montag, 27. November 2023 um 16:52:46 UTC+1:

>
> Hi Simon, yes, I think the instructions you can give to the segmentation 
> step are quite limited, mostly the PSM parameter and I suppose a few minor 
> ones. There is something about tables but I've never used it and yours 
> might be too small for this to work. Yes, you should be able to see what is 
> happening looking at the HOCR file.
>
> You could also try the attached script, it was made for the 4.x version 
> but might work with 5.x too. It draws boxes around letters according to the 
> tesseract output. I'm attaching the output on a simple text and on several 
> crops from your image: only in the clean one you can see the text boxes. 
> You can do the same from the HOCR file.
>
> Yes, you still need to fine tune for the new character. I was able to 
> train up to 57k iterations still improving the results on a test dataset. 
> You need to fine tune including the new symbols AND all the other symbols 
> you expect to recognize in the training dataset.
>
>
> I'm not sure if you are using something like this:
>
>  merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset 
> $(TRAIN)/my.unicharset  "$@"
>
> if so, you can replace it with:
>
>  cp "$(TRAIN)/my.unicharset" "data/unicharset"
>
> and the new model will output only the characters that are present in your 
> new dataset (for example to discard lower case letters, the < character, %, 
> !, #, etc.)
>
> Also, if you do not need to recognize the < symbol, you could reuse this 
> rather than adding a new one completely. I mean that when you generate the 
> images with the "angle" symbol you put < in the transcription. Maybe it 
> helps, maybe it won't.
>
>
>
> Bye
>
> Lorenzo
>
>
>
>
> Il giorno sab 25 nov 2023 alle ore 12:25 Simon  ha 
> scritto:
>
>> Yes in general I want to recognice this part  "< 0,05 A" except that the 
>> < ist actually  ∠  the character for angularity. 
>>
>> The segmentation process of tesseract can't be edited right? So you mean 
>> I would need to make an Tesseract independent program that localizes the 
>> boxes crops them out and feeds them to Tesseract? In that case I still 
>> would need to train Tesseract for recognizing  ∠ .  So I am still 
>> wondering how to train this sign properly. 
>>
>> Because you asked if the isolation step is able to isolate it, I can 
>> check this by looking at the hocr information right?
>>
>>
>>
>> Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:
>>
>>> Hi Simon,
>>> if I understand correctly how tesseract works, it follows this steps:
>>>
>>> - it segments the image into lines of text
>>> - it then takes each individual line and slides a small window, 1px wide 
>>> I think, over it, from one end to the other. For each step the model 
>>> outputs a prediction. The model, being an bidirectional LSTM has some 
>>> memory of the previous and following pixel columns.
>>> - all these predictions are converted into characters using beam search
>>>
>>> Please correct me if I got it wrong. So the first thing I think looking 
>>> at your picture is the segmentation step. Do you want to read the "< 0,05 
>>> A" block only? Is the segmentation step able to isolate it? This is the 
>>> first thing I would try to understand.
>>> Also your sample image for "<" has a very different angle to the one 
>>> before 0,05.
>>>
>>> In this case a would try to do a custom segmentation, looking for 
>>> rectangular boxes of a certain height, aspect ratio, etc. Then cropping 
>>> these out (maybe dropping the rectangular box and the black vertical lines) 
>>> and feed them to tesseract. This of course requires custom programming.
>>>
>>> This might give good results even without fine tuning. I would try this 
>>> manually with GIMP first.
>>>
>>>
>>> Also I suppose you are not going to encounter a lot of wild fonts into 
>>> these kind of diagrams. The more fonts you use, the harder the training. I 
>>> would focus on very few fonts, even one. I would start with exactly one 
>>> font and train on these to see quickly if my training setup/pipeline is 
>>> working. And if the training results reflect 

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-27 Thread Lorenzo Bolzani
Hi Simon, yes, I think the instructions you can give to the segmentation
step are quite limited, mostly the PSM parameter and I suppose a few minor
ones. There is something about tables but I've never used it and yours
might be too small for this to work. Yes, you should be able to see what is
happening looking at the HOCR file.

You could also try the attached script, it was made for the 4.x version but
might work with 5.x too. It draws boxes around letters according to the
tesseract output. I'm attaching the output on a simple text and on several
crops from your image: only in the clean one you can see the text boxes.
You can do the same from the HOCR file.

Yes, you still need to fine tune for the new character. I was able to train
up to 57k iterations still improving the results on a test dataset. You
need to fine tune including the new symbols AND all the other symbols you
expect to recognize in the training dataset.


I'm not sure if you are using something like this:

 merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
$(TRAIN)/my.unicharset  "$@"

if so, you can replace it with:

 cp "$(TRAIN)/my.unicharset" "data/unicharset"

and the new model will output only the characters that are present in your
new dataset (for example to discard lower case letters, the < character, %,
!, #, etc.)

Also, if you do not need to recognize the < symbol, you could reuse this
rather than adding a new one completely. I mean that when you generate the
images with the "angle" symbol you put < in the transcription. Maybe it
helps, maybe it won't.



Bye

Lorenzo




Il giorno sab 25 nov 2023 alle ore 12:25 Simon  ha
scritto:

> Yes in general I want to recognice this part  "< 0,05 A" except that the
> < ist actually  ∠  the character for angularity.
>
> The segmentation process of tesseract can't be edited right? So you mean I
> would need to make an Tesseract independent program that localizes the
> boxes crops them out and feeds them to Tesseract? In that case I still
> would need to train Tesseract for recognizing  ∠ .  So I am still
> wondering how to train this sign properly.
>
> Because you asked if the isolation step is able to isolate it, I can check
> this by looking at the hocr information right?
>
>
>
> Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:
>
>> Hi Simon,
>> if I understand correctly how tesseract works, it follows this steps:
>>
>> - it segments the image into lines of text
>> - it then takes each individual line and slides a small window, 1px wide
>> I think, over it, from one end to the other. For each step the model
>> outputs a prediction. The model, being an bidirectional LSTM has some
>> memory of the previous and following pixel columns.
>> - all these predictions are converted into characters using beam search
>>
>> Please correct me if I got it wrong. So the first thing I think looking
>> at your picture is the segmentation step. Do you want to read the "< 0,05
>> A" block only? Is the segmentation step able to isolate it? This is the
>> first thing I would try to understand.
>> Also your sample image for "<" has a very different angle to the one
>> before 0,05.
>>
>> In this case a would try to do a custom segmentation, looking for
>> rectangular boxes of a certain height, aspect ratio, etc. Then cropping
>> these out (maybe dropping the rectangular box and the black vertical lines)
>> and feed them to tesseract. This of course requires custom programming.
>>
>> This might give good results even without fine tuning. I would try this
>> manually with GIMP first.
>>
>>
>> Also I suppose you are not going to encounter a lot of wild fonts into
>> these kind of diagrams. The more fonts you use, the harder the training. I
>> would focus on very few fonts, even one. I would start with exactly one
>> font and train on these to see quickly if my training setup/pipeline is
>> working. And if the training results reflect onto the diagrams later. If
>> the model error rate is good on the individual text lines and it is bad on
>> the real images it might be a segmentation problem that training cannot
>> fix. Or the problem might be the external box, that I suppose you do not
>> have in your generated data.
>>
>> Ideally, I would use real crops from these diagrams rather than images
>> from text2image.
>>
>> Also distinguishing 0 from O with many fonts is very hard. Often you have
>> domain knowledge that can help you to fix these errors in post, for example
>> 0,O5 can be easily spotted and fixed. You can, for example, assume that
>> each box contains only one kind of data and guess the most likely one from
>> this or from the box sequence, etc.
>>
>> I got good results with 20k samples (real world scanned docs, multi
>> fonts). 10k seems reasonable, I also assume your output "characters set" is
>> very small, like the numbers and a few capital letters and a couple of
>> symbols (no %, ^, &, {, etc.).
>>
>>
>>
>> Lorenzo
>>
>> Il giorno gio 23 nov 2023 alle ore 10:16 

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-25 Thread Simon
Yes in general I want to recognice this part  "< 0,05 A" except that the < 
ist actually  ∠  the character for angularity. 

The segmentation process of tesseract can't be edited right? So you mean I 
would need to make an Tesseract independent program that localizes the 
boxes crops them out and feeds them to Tesseract? In that case I still 
would need to train Tesseract for recognizing  ∠ .  So I am still wondering 
how to train this sign properly. 

Because you asked if the isolation step is able to isolate it, I can check 
this by looking at the hocr information right?



Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:

> Hi Simon,
> if I understand correctly how tesseract works, it follows this steps:
>
> - it segments the image into lines of text
> - it then takes each individual line and slides a small window, 1px wide I 
> think, over it, from one end to the other. For each step the model outputs 
> a prediction. The model, being an bidirectional LSTM has some memory of the 
> previous and following pixel columns.
> - all these predictions are converted into characters using beam search
>
> Please correct me if I got it wrong. So the first thing I think looking at 
> your picture is the segmentation step. Do you want to read the "< 0,05 A" 
> block only? Is the segmentation step able to isolate it? This is the first 
> thing I would try to understand.
> Also your sample image for "<" has a very different angle to the one 
> before 0,05.
>
> In this case a would try to do a custom segmentation, looking for 
> rectangular boxes of a certain height, aspect ratio, etc. Then cropping 
> these out (maybe dropping the rectangular box and the black vertical lines) 
> and feed them to tesseract. This of course requires custom programming.
>
> This might give good results even without fine tuning. I would try this 
> manually with GIMP first.
>
>
> Also I suppose you are not going to encounter a lot of wild fonts into 
> these kind of diagrams. The more fonts you use, the harder the training. I 
> would focus on very few fonts, even one. I would start with exactly one 
> font and train on these to see quickly if my training setup/pipeline is 
> working. And if the training results reflect onto the diagrams later. If 
> the model error rate is good on the individual text lines and it is bad on 
> the real images it might be a segmentation problem that training cannot 
> fix. Or the problem might be the external box, that I suppose you do not 
> have in your generated data.
>
> Ideally, I would use real crops from these diagrams rather than images 
> from text2image.
>
> Also distinguishing 0 from O with many fonts is very hard. Often you have 
> domain knowledge that can help you to fix these errors in post, for example 
> 0,O5 can be easily spotted and fixed. You can, for example, assume that 
> each box contains only one kind of data and guess the most likely one from 
> this or from the box sequence, etc.
>
> I got good results with 20k samples (real world scanned docs, multi 
> fonts). 10k seems reasonable, I also assume your output "characters set" is 
> very small, like the numbers and a few capital letters and a couple of 
> symbols (no %, ^, &, {, etc.).
>
>
>
> Lorenzo
>
> Il giorno gio 23 nov 2023 alle ore 10:16 Simon  ha 
> scritto:
>
>> If I need to train new characters that are not recognized by a default 
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity:  ∠
>>
>> This symbols appear in technical drawings and should be recognised in 
>> those. E.g. for the scenario in the following picture tesseract should 
>> reconize this symbol. 
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with: 
>> [image: angularity_0_r0.jpg] 
>> They all look pretty similar to this one. Things that change are the 
>> angle, the propotion and the thickness of the lines. All examples have this 
>> 64x64 pixel box around it. 
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find 
>> information for fine tuning for specific fonts. For fine tune also the 
>> "tesstrain" repository would not be needed as it is used for training from 
>> scratch, correct?
>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train 
>>> from scratch. If you can't make more than that data, you might first try to 
>>> fine tune:and then train by removing the top layer of the best model. 
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
>>> wrote:
>>>
 As it is not properly possible to combine my traineddata from scratch 
 with an existing one, I have decided to also train my traineddata model 
 numbers. Therefore I wrote a script which synthetically generates 
 groundtruth data with text2image. 
 This script uses dozens of different fonts and creates 

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-24 Thread Des Bw
@zdenop: 
Yes, because the characters start to show up (get recognized) only after 
you run a few thousands of iterations. For me, new characters start to get 
recognized only after I run 5000 iterations. At that point, the base model 
will be deteriorated terribly. It is now a common knowledge that a 
fine-tuning running above 400 iterations highly compromises the base model. 
For that, fine-tuning is not effective to add new characters (even if the 
guide says that is possible). 

Dear Zdenop, I would love to be know if there is a way around it. I am 
languishing with tesseract for months now because the default model missed 
one important character.  
On Thursday, November 23, 2023 at 8:59:01 PM UTC+3 zdenop wrote:

>
> št 23. 11. 2023 o 10:28 Des Bw  napísal(a):
>
>> If the original model lacks the ∠ symbol, fine tuning is not going to 
>> add it for you.
>
>
> Really??? 
> Tesseract documentation 
> :
>  
> Fine tuning is the process of training an existing model on new data 
> without changing any part of the network, although you *can* now add 
> characters to the character set. (See Fine Tuning for ± a few characters 
> 
> ).
>
>  
>
>> We have all went through that process. To introduce a new character, 
>> removing the top layer and train from there is the most 
>> effective approach.  
>>
>> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com 
>> wrote:
>>
>>> If I need to train new characters that are not recognized by a default 
>>> model, is fine tuning in this case the right approach?
>>> One of these characters ist the one for angularity:  ∠
>>>
>>> This symbols appear in technical drawings and should be recognised in 
>>> those. E.g. for the scenario in the following picture tesseract should 
>>> reconize this symbol. 
>>>
>>>
>>>
>>> [image: angularity.png]
>>>
>>> Also here is one of the pngs I tried to train with: 
>>> [image: angularity_0_r0.jpg] 
>>> They all look pretty similar to this one. Things that change are the 
>>> angle, the propotion and the thickness of the lines. All examples have this 
>>> 64x64 pixel box around it. 
>>>
>>>
>>> Is Fine Tuning for this scenario the right approach as I only find 
>>> information for fine tuning for specific fonts. For fine tune also the 
>>> "tesstrain" repository would not be needed as it is used for training from 
>>> scratch, correct?
>>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
>>> UTC+1:
>>>
 From my limited experience, you need a lot more data than that to train 
 from scratch. If you can't make more than that data, you might first try 
 to 
 fine tune:and then train by removing the top layer of the best model. 

 On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
 wrote:

> As it is not properly possible to combine my traineddata from scratch 
> with an existing one, I have decided to also train my traineddata model 
> numbers. Therefore I wrote a script which synthetically generates 
> groundtruth data with text2image. 
> This script uses dozens of different fonts and creates numbers for the 
> following formats. 
> X.XXX
> X.XX
> X,XX
> X,XXX
> I generated 10,000 files to train the numbers. But unfortunately 
> numbers get recognized pretty poorly with the best model. (most of times 
> only "0."; "0" or "0," gets recognized)  
> So I wanted to ask if It is not enough training (ground truth data) 
> for proper recognition when I train several fonts. 
> Thanks in advance for you help. 
>
 -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e7ce8453-caf3-46ac-ae94-a795ad27fd4fn%40googlegroups.com.


Re: [tesseract-ocr] Re: Training from Scratch

2023-11-24 Thread Lorenzo Bolzani
Hi Simon,
if I understand correctly how tesseract works, it follows this steps:

- it segments the image into lines of text
- it then takes each individual line and slides a small window, 1px wide I
think, over it, from one end to the other. For each step the model outputs
a prediction. The model, being an bidirectional LSTM has some memory of the
previous and following pixel columns.
- all these predictions are converted into characters using beam search

Please correct me if I got it wrong. So the first thing I think looking at
your picture is the segmentation step. Do you want to read the "< 0,05 A"
block only? Is the segmentation step able to isolate it? This is the first
thing I would try to understand.
Also your sample image for "<" has a very different angle to the one before
0,05.

In this case a would try to do a custom segmentation, looking for
rectangular boxes of a certain height, aspect ratio, etc. Then cropping
these out (maybe dropping the rectangular box and the black vertical lines)
and feed them to tesseract. This of course requires custom programming.

This might give good results even without fine tuning. I would try this
manually with GIMP first.


Also I suppose you are not going to encounter a lot of wild fonts into
these kind of diagrams. The more fonts you use, the harder the training. I
would focus on very few fonts, even one. I would start with exactly one
font and train on these to see quickly if my training setup/pipeline is
working. And if the training results reflect onto the diagrams later. If
the model error rate is good on the individual text lines and it is bad on
the real images it might be a segmentation problem that training cannot
fix. Or the problem might be the external box, that I suppose you do not
have in your generated data.

Ideally, I would use real crops from these diagrams rather than images from
text2image.

Also distinguishing 0 from O with many fonts is very hard. Often you have
domain knowledge that can help you to fix these errors in post, for example
0,O5 can be easily spotted and fixed. You can, for example, assume that
each box contains only one kind of data and guess the most likely one from
this or from the box sequence, etc.

I got good results with 20k samples (real world scanned docs, multi fonts).
10k seems reasonable, I also assume your output "characters set" is very
small, like the numbers and a few capital letters and a couple of symbols
(no %, ^, &, {, etc.).



Lorenzo

Il giorno gio 23 nov 2023 alle ore 10:16 Simon  ha
scritto:

> If I need to train new characters that are not recognized by a default
> model, is fine tuning in this case the right approach?
> One of these characters ist the one for angularity:  ∠
>
> This symbols appear in technical drawings and should be recognised in
> those. E.g. for the scenario in the following picture tesseract should
> reconize this symbol.
>
>
>
> [image: angularity.png]
>
> Also here is one of the pngs I tried to train with:
> [image: angularity_0_r0.jpg]
> They all look pretty similar to this one. Things that change are the
> angle, the propotion and the thickness of the lines. All examples have this
> 64x64 pixel box around it.
>
>
> Is Fine Tuning for this scenario the right approach as I only find
> information for fine tuning for specific fonts. For fine tune also the
> "tesstrain" repository would not be needed as it is used for training from
> scratch, correct?
> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02
> UTC+1:
>
>> From my limited experience, you need a lot more data than that to train
>> from scratch. If you can't make more than that data, you might first try to
>> fine tune:and then train by removing the top layer of the best model.
>>
>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com
>> wrote:
>>
>>> As it is not properly possible to combine my traineddata from scratch
>>> with an existing one, I have decided to also train my traineddata model
>>> numbers. Therefore I wrote a script which synthetically generates
>>> groundtruth data with text2image.
>>> This script uses dozens of different fonts and creates numbers for the
>>> following formats.
>>> X.XXX
>>> X.XX
>>> X,XX
>>> X,XXX
>>> I generated 10,000 files to train the numbers. But unfortunately numbers
>>> get recognized pretty poorly with the best model. (most of times only "0.";
>>> "0" or "0," gets recognized)
>>> So I wanted to ask if It is not enough training (ground truth data) for
>>> proper recognition when I train several fonts.
>>> Thanks in advance for you help.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com
> 

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Zdenko Podobny
št 23. 11. 2023 o 10:28 Des Bw  napísal(a):

> If the original model lacks the ∠ symbol, fine tuning is not going to add
> it for you.


Really???
Tesseract documentation
:
Fine tuning is the process of training an existing model on new data
without changing any part of the network, although you *can* now add
characters to the character set. (See Fine Tuning for ± a few characters

).



> We have all went through that process. To introduce a new character,
> removing the top layer and train from there is the most
> effective approach.
>
> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com
> wrote:
>
>> If I need to train new characters that are not recognized by a default
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity:  ∠
>>
>> This symbols appear in technical drawings and should be recognised in
>> those. E.g. for the scenario in the following picture tesseract should
>> reconize this symbol.
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with:
>> [image: angularity_0_r0.jpg]
>> They all look pretty similar to this one. Things that change are the
>> angle, the propotion and the thickness of the lines. All examples have this
>> 64x64 pixel box around it.
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find
>> information for fine tuning for specific fonts. For fine tune also the
>> "tesstrain" repository would not be needed as it is used for training from
>> scratch, correct?
>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train
>>> from scratch. If you can't make more than that data, you might first try to
>>> fine tune:and then train by removing the top layer of the best model.
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com
>>> wrote:
>>>
 As it is not properly possible to combine my traineddata from scratch
 with an existing one, I have decided to also train my traineddata model
 numbers. Therefore I wrote a script which synthetically generates
 groundtruth data with text2image.
 This script uses dozens of different fonts and creates numbers for the
 following formats.
 X.XXX
 X.XX
 X,XX
 X,XXX
 I generated 10,000 files to train the numbers. But unfortunately
 numbers get recognized pretty poorly with the best model. (most of times
 only "0."; "0" or "0," gets recognized)
 So I wanted to ask if It is not enough training (ground truth data) for
 proper recognition when I train several fonts.
 Thanks in advance for you help.

>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wt2bNNDBQoBBDGezC_UCScqeaGXS6eyTFf8boam5s%2Bgg%40mail.gmail.com.


[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
If you are planning to train, you need to make sure that your images 
contain all those variations: in thickness, angle etc. I don't know if 
text2image can do that for you. You might need to do it manually; or use 
some other tool. 

On Thursday, November 23, 2023 at 12:39:21 PM UTC+3 Des Bw wrote:

> Download the best model and try it. If it recognizes, that is great. You 
> an also look at the unicharset of the best model. 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ec5de9fc-a1f9-49ee-b0b1-9b4e8f4f980en%40googlegroups.com.


[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
Download the best model and try it. If it recognizes, that is great. You an 
also look at the unicharset of the best model. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/75a6bfcc-6ba1-4277-88ae-8f528669de9en%40googlegroups.com.


[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Simon
Thanks a lot!
This is not possible with the tesstrain repository right?

desal...@gmail.com schrieb am Donnerstag, 23. November 2023 um 10:28:26 
UTC+1:

> If the original model lacks the ∠ symbol, fine tuning is not going to add 
> it for you. We have all went through that process. To introduce a new 
> character, removing the top layer and train from there is the most 
> effective approach.  
>
> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com 
> wrote:
>
>> If I need to train new characters that are not recognized by a default 
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity:  ∠
>>
>> This symbols appear in technical drawings and should be recognised in 
>> those. E.g. for the scenario in the following picture tesseract should 
>> reconize this symbol. 
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with: 
>> [image: angularity_0_r0.jpg] 
>> They all look pretty similar to this one. Things that change are the 
>> angle, the propotion and the thickness of the lines. All examples have this 
>> 64x64 pixel box around it. 
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find 
>> information for fine tuning for specific fonts. For fine tune also the 
>> "tesstrain" repository would not be needed as it is used for training from 
>> scratch, correct?
>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train 
>>> from scratch. If you can't make more than that data, you might first try to 
>>> fine tune:and then train by removing the top layer of the best model. 
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
>>> wrote:
>>>
 As it is not properly possible to combine my traineddata from scratch 
 with an existing one, I have decided to also train my traineddata model 
 numbers. Therefore I wrote a script which synthetically generates 
 groundtruth data with text2image. 
 This script uses dozens of different fonts and creates numbers for the 
 following formats. 
 X.XXX
 X.XX
 X,XX
 X,XXX
 I generated 10,000 files to train the numbers. But unfortunately 
 numbers get recognized pretty poorly with the best model. (most of times 
 only "0."; "0" or "0," gets recognized)  
 So I wanted to ask if It is not enough training (ground truth data) for 
 proper recognition when I train several fonts. 
 Thanks in advance for you help. 

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/23835b33-025a-48ad-9037-3eef237393cfn%40googlegroups.com.


[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
If the original model lacks the ∠ symbol, fine tuning is not going to add 
it for you. We have all went through that process. To introduce a new 
character, removing the top layer and train from there is the most 
effective approach.  

On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com wrote:

> If I need to train new characters that are not recognized by a default 
> model, is fine tuning in this case the right approach?
> One of these characters ist the one for angularity:  ∠
>
> This symbols appear in technical drawings and should be recognised in 
> those. E.g. for the scenario in the following picture tesseract should 
> reconize this symbol. 
>
>
>
> [image: angularity.png]
>
> Also here is one of the pngs I tried to train with: 
> [image: angularity_0_r0.jpg] 
> They all look pretty similar to this one. Things that change are the 
> angle, the propotion and the thickness of the lines. All examples have this 
> 64x64 pixel box around it. 
>
>
> Is Fine Tuning for this scenario the right approach as I only find 
> information for fine tuning for specific fonts. For fine tune also the 
> "tesstrain" repository would not be needed as it is used for training from 
> scratch, correct?
> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 
> UTC+1:
>
>> From my limited experience, you need a lot more data than that to train 
>> from scratch. If you can't make more than that data, you might first try to 
>> fine tune:and then train by removing the top layer of the best model. 
>>
>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
>> wrote:
>>
>>> As it is not properly possible to combine my traineddata from scratch 
>>> with an existing one, I have decided to also train my traineddata model 
>>> numbers. Therefore I wrote a script which synthetically generates 
>>> groundtruth data with text2image. 
>>> This script uses dozens of different fonts and creates numbers for the 
>>> following formats. 
>>> X.XXX
>>> X.XX
>>> X,XX
>>> X,XXX
>>> I generated 10,000 files to train the numbers. But unfortunately numbers 
>>> get recognized pretty poorly with the best model. (most of times only "0."; 
>>> "0" or "0," gets recognized)  
>>> So I wanted to ask if It is not enough training (ground truth data) for 
>>> proper recognition when I train several fonts. 
>>> Thanks in advance for you help. 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com.


[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Simon
If I need to train new characters that are not recognized by a default 
model, is fine tuning in this case the right approach?
One of these characters ist the one for angularity:  ∠

This symbols appear in technical drawings and should be recognised in 
those. E.g. for the scenario in the following picture tesseract should 
reconize this symbol. 



[image: angularity.png]

Also here is one of the pngs I tried to train with: 
[image: angularity_0_r0.jpg] 
They all look pretty similar to this one. Things that change are the angle, 
the propotion and the thickness of the lines. All examples have this 64x64 
pixel box around it. 


Is Fine Tuning for this scenario the right approach as I only find 
information for fine tuning for specific fonts. For fine tune also the 
"tesstrain" repository would not be needed as it is used for training from 
scratch, correct?
desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 UTC+1:

> From my limited experience, you need a lot more data than that to train 
> from scratch. If you can't make more than that data, you might first try to 
> fine tune:and then train by removing the top layer of the best model. 
>
> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com 
> wrote:
>
>> As it is not properly possible to combine my traineddata from scratch 
>> with an existing one, I have decided to also train my traineddata model 
>> numbers. Therefore I wrote a script which synthetically generates 
>> groundtruth data with text2image. 
>> This script uses dozens of different fonts and creates numbers for the 
>> following formats. 
>> X.XXX
>> X.XX
>> X,XX
>> X,XXX
>> I generated 10,000 files to train the numbers. But unfortunately numbers 
>> get recognized pretty poorly with the best model. (most of times only "0."; 
>> "0" or "0," gets recognized)  
>> So I wanted to ask if It is not enough training (ground truth data) for 
>> proper recognition when I train several fonts. 
>> Thanks in advance for you help. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com.


[tesseract-ocr] Re: Training from Scratch

2023-11-22 Thread Des Bw
>From my limited experience, you need a lot more data than that to train 
from scratch. If you can't make more than that data, you might first try to 
fine tune:and then train by removing the top layer of the best model. 

On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com wrote:

> As it is not properly possible to combine my traineddata from scratch with 
> an existing one, I have decided to also train my traineddata model numbers. 
> Therefore I wrote a script which synthetically generates groundtruth data 
> with text2image. 
> This script uses dozens of different fonts and creates numbers for the 
> following formats. 
> X.XXX
> X.XX
> X,XX
> X,XXX
> I generated 10,000 files to train the numbers. But unfortunately numbers 
> get recognized pretty poorly with the best model. (most of times only "0."; 
> "0" or "0," gets recognized)  
> So I wanted to ask if It is not enough training (ground truth data) for 
> proper recognition when I train several fonts. 
> Thanks in advance for you help. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7bb9eb1b-3e6e-47f7-bb13-03fc0fb5505dn%40googlegroups.com.