Yes in general I want to recognice this part "< 0,05 A" except that the < ist actually ∠ the character for angularity.
The segmentation process of tesseract can't be edited right? So you mean I would need to make an Tesseract independent program that localizes the boxes crops them out and feeds them to Tesseract? In that case I still would need to train Tesseract for recognizing ∠ . So I am still wondering how to train this sign properly. Because you asked if the isolation step is able to isolate it, I can check this by looking at the hocr information right? Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1: > Hi Simon, > if I understand correctly how tesseract works, it follows this steps: > > - it segments the image into lines of text > - it then takes each individual line and slides a small window, 1px wide I > think, over it, from one end to the other. For each step the model outputs > a prediction. The model, being an bidirectional LSTM has some memory of the > previous and following pixel columns. > - all these predictions are converted into characters using beam search > > Please correct me if I got it wrong. So the first thing I think looking at > your picture is the segmentation step. Do you want to read the "< 0,05 A" > block only? Is the segmentation step able to isolate it? This is the first > thing I would try to understand. > Also your sample image for "<" has a very different angle to the one > before 0,05. > > In this case a would try to do a custom segmentation, looking for > rectangular boxes of a certain height, aspect ratio, etc. Then cropping > these out (maybe dropping the rectangular box and the black vertical lines) > and feed them to tesseract. This of course requires custom programming. > > This might give good results even without fine tuning. I would try this > manually with GIMP first. > > > Also I suppose you are not going to encounter a lot of wild fonts into > these kind of diagrams. The more fonts you use, the harder the training. I > would focus on very few fonts, even one. I would start with exactly one > font and train on these to see quickly if my training setup/pipeline is > working. And if the training results reflect onto the diagrams later. If > the model error rate is good on the individual text lines and it is bad on > the real images it might be a segmentation problem that training cannot > fix. Or the problem might be the external box, that I suppose you do not > have in your generated data. > > Ideally, I would use real crops from these diagrams rather than images > from text2image. > > Also distinguishing 0 from O with many fonts is very hard. Often you have > domain knowledge that can help you to fix these errors in post, for example > 0,O5 can be easily spotted and fixed. You can, for example, assume that > each box contains only one kind of data and guess the most likely one from > this or from the box sequence, etc. > > I got good results with 20k samples (real world scanned docs, multi > fonts). 10k seems reasonable, I also assume your output "characters set" is > very small, like the numbers and a few capital letters and a couple of > symbols (no %, ^, &, {, etc.). > > > > Lorenzo > > Il giorno gio 23 nov 2023 alle ore 10:16 Simon <smon...@gmail.com> ha > scritto: > >> If I need to train new characters that are not recognized by a default >> model, is fine tuning in this case the right approach? >> One of these characters ist the one for angularity: ∠ >> >> This symbols appear in technical drawings and should be recognised in >> those. E.g. for the scenario in the following picture tesseract should >> reconize this symbol. >> >> >> >> [image: angularity.png] >> >> Also here is one of the pngs I tried to train with: >> [image: angularity_0_r0.jpg] >> They all look pretty similar to this one. Things that change are the >> angle, the propotion and the thickness of the lines. All examples have this >> 64x64 pixel box around it. >> >> >> Is Fine Tuning for this scenario the right approach as I only find >> information for fine tuning for specific fonts. For fine tune also the >> "tesstrain" repository would not be needed as it is used for training from >> scratch, correct? >> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 >> UTC+1: >> >>> From my limited experience, you need a lot more data than that to train >>> from scratch. If you can't make more than that data, you might first try to >>> fine tune:and then train by removing the top layer of the best model. >>> >>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com >>> wrote: >>> >>>> As it is not properly possible to combine my traineddata from scratch >>>> with an existing one, I have decided to also train my traineddata model >>>> numbers. Therefore I wrote a script which synthetically generates >>>> groundtruth data with text2image. >>>> This script uses dozens of different fonts and creates numbers for the >>>> following formats. >>>> X.XXX >>>> X.XX >>>> X,XX >>>> X,XXX >>>> I generated 10,000 files to train the numbers. But unfortunately >>>> numbers get recognized pretty poorly with the best model. (most of times >>>> only "0."; "0" or "0," gets recognized) >>>> So I wanted to ask if It is not enough training (ground truth data) for >>>> proper recognition when I train several fonts. >>>> Thanks in advance for you help. >>>> >>> -- >> > You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31d6a1f5-d114-485b-b6b3-897c57616783n%40googlegroups.com.