Re: [tesseract-ocr] Re: Training from Scratch

2023-11-29 Thread Simon
Hey Lorenzo, thanks a lot for your response. I've seen in the HOCR files of different technical drawings that the Tesseract Text Segmentation has massive problems recognizing zones with text, probably because of the varios lines and complex constructions within the technical drawing. Even the

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-27 Thread Lorenzo Bolzani
Hi Simon, yes, I think the instructions you can give to the segmentation step are quite limited, mostly the PSM parameter and I suppose a few minor ones. There is something about tables but I've never used it and yours might be too small for this to work. Yes, you should be able to see what is

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-25 Thread Simon
Yes in general I want to recognice this part "< 0,05 A" except that the < ist actually ∠ the character for angularity. The segmentation process of tesseract can't be edited right? So you mean I would need to make an Tesseract independent program that localizes the boxes crops them out and

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-24 Thread Des Bw
@zdenop: Yes, because the characters start to show up (get recognized) only after you run a few thousands of iterations. For me, new characters start to get recognized only after I run 5000 iterations. At that point, the base model will be deteriorated terribly. It is now a common knowledge

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-24 Thread Lorenzo Bolzani
Hi Simon, if I understand correctly how tesseract works, it follows this steps: - it segments the image into lines of text - it then takes each individual line and slides a small window, 1px wide I think, over it, from one end to the other. For each step the model outputs a prediction. The model,

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Zdenko Podobny
št 23. 11. 2023 o 10:28 Des Bw napísal(a): > If the original model lacks the ∠ symbol, fine tuning is not going to add > it for you. Really??? Tesseract documentation

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
If you are planning to train, you need to make sure that your images contain all those variations: in thickness, angle etc. I don't know if text2image can do that for you. You might need to do it manually; or use some other tool. On Thursday, November 23, 2023 at 12:39:21 PM UTC+3 Des Bw

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
Download the best model and try it. If it recognizes, that is great. You an also look at the unicharset of the best model. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Simon
Thanks a lot! This is not possible with the tesstrain repository right? desal...@gmail.com schrieb am Donnerstag, 23. November 2023 um 10:28:26 UTC+1: > If the original model lacks the ∠ symbol, fine tuning is not going to add > it for you. We have all went through that process. To introduce a

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Des Bw
If the original model lacks the ∠ symbol, fine tuning is not going to add it for you. We have all went through that process. To introduce a new character, removing the top layer and train from there is the most effective approach. On Thursday, November 23, 2023 at 12:15:56 PM UTC+3

[tesseract-ocr] Re: Training from Scratch

2023-11-23 Thread Simon
If I need to train new characters that are not recognized by a default model, is fine tuning in this case the right approach? One of these characters ist the one for angularity: ∠ This symbols appear in technical drawings and should be recognised in those. E.g. for the scenario in the

[tesseract-ocr] Re: Training from Scratch

2023-11-22 Thread Des Bw
>From my limited experience, you need a lot more data than that to train from scratch. If you can't make more than that data, you might first try to fine tune:and then train by removing the top layer of the best model. On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com