@zdenop: Yes, because the characters start to show up (get recognized) only after you run a few thousands of iterations. For me, new characters start to get recognized only after I run 5000 iterations. At that point, the base model will be deteriorated terribly. It is now a common knowledge that a fine-tuning running above 400 iterations highly compromises the base model. For that, fine-tuning is not effective to add new characters (even if the guide says that is possible).
Dear Zdenop, I would love to be know if there is a way around it. I am languishing with tesseract for months now because the default model missed one important character. On Thursday, November 23, 2023 at 8:59:01 PM UTC+3 zdenop wrote: > > št 23. 11. 2023 o 10:28 Des Bw <desal...@gmail.com> napísal(a): > >> If the original model lacks the ∠ symbol, fine tuning is not going to >> add it for you. > > > Really??? > Tesseract documentation > <https://github.com/tesseract-ocr/tessdoc/blob/2f4d1e47094acbe3e046144573c928d740595f55/tess4/TrainingTesseract-4.00.md#fine-tuning-for-impact>: > > Fine tuning is the process of training an existing model on new data > without changing any part of the network, although you *can* now add > characters to the character set. (See Fine Tuning for ± a few characters > <https://github.com/tesseract-ocr/tessdoc/blob/2f4d1e47094acbe3e046144573c928d740595f55/tess4/TrainingTesseract-4.00.md#fine-tuning-for--a-few-characters> > ). > > > >> We have all went through that process. To introduce a new character, >> removing the top layer and train from there is the most >> effective approach. >> >> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 smon...@gmail.com >> wrote: >> >>> If I need to train new characters that are not recognized by a default >>> model, is fine tuning in this case the right approach? >>> One of these characters ist the one for angularity: ∠ >>> >>> This symbols appear in technical drawings and should be recognised in >>> those. E.g. for the scenario in the following picture tesseract should >>> reconize this symbol. >>> >>> >>> >>> [image: angularity.png] >>> >>> Also here is one of the pngs I tried to train with: >>> [image: angularity_0_r0.jpg] >>> They all look pretty similar to this one. Things that change are the >>> angle, the propotion and the thickness of the lines. All examples have this >>> 64x64 pixel box around it. >>> >>> >>> Is Fine Tuning for this scenario the right approach as I only find >>> information for fine tuning for specific fonts. For fine tune also the >>> "tesstrain" repository would not be needed as it is used for training from >>> scratch, correct? >>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02 >>> UTC+1: >>> >>>> From my limited experience, you need a lot more data than that to train >>>> from scratch. If you can't make more than that data, you might first try >>>> to >>>> fine tune:and then train by removing the top layer of the best model. >>>> >>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com >>>> wrote: >>>> >>>>> As it is not properly possible to combine my traineddata from scratch >>>>> with an existing one, I have decided to also train my traineddata model >>>>> numbers. Therefore I wrote a script which synthetically generates >>>>> groundtruth data with text2image. >>>>> This script uses dozens of different fonts and creates numbers for the >>>>> following formats. >>>>> X.XXX >>>>> X.XX >>>>> X,XX >>>>> X,XXX >>>>> I generated 10,000 files to train the numbers. But unfortunately >>>>> numbers get recognized pretty poorly with the best model. (most of times >>>>> only "0."; "0" or "0," gets recognized) >>>>> So I wanted to ask if It is not enough training (ground truth data) >>>>> for proper recognition when I train several fonts. >>>>> Thanks in advance for you help. >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e7ce8453-caf3-46ac-ae94-a795ad27fd4fn%40googlegroups.com.