Yes in general I want to recognice this part "< 0,05 A" except that the <
ist actually ∠ the character for angularity.
The segmentation process of tesseract can't be edited right? So you mean I
would need to make an Tesseract independent program that localizes the
boxes crops them out and feeds them to Tesseract? In that case I still
would need to train Tesseract for recognizing ∠ . So I am still wondering
how to train this sign properly.
Because you asked if the isolation step is able to isolate it, I can check
this by looking at the hocr information right?
Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:
> Hi Simon,
> if I understand correctly how tesseract works, it follows this steps:
>
> - it segments the image into lines of text
> - it then takes each individual line and slides a small window, 1px wide I
> think, over it, from one end to the other. For each step the model outputs
> a prediction. The model, being an bidirectional LSTM has some memory of the
> previous and following pixel columns.
> - all these predictions are converted into characters using beam search
>
> Please correct me if I got it wrong. So the first thing I think looking at
> your picture is the segmentation step. Do you want to read the "< 0,05 A"
> block only? Is the segmentation step able to isolate it? This is the first
> thing I would try to understand.
> Also your sample image for "<" has a very different angle to the one
> before 0,05.
>
> In this case a would try to do a custom segmentation, looking for
> rectangular boxes of a certain height, aspect ratio, etc. Then cropping
> these out (maybe dropping the rectangular box and the black vertical lines)
> and feed them to tesseract. This of course requires custom programming.
>
> This might give good results even without fine tuning. I would try this
> manually with GIMP first.
>
>
> Also I suppose you are not going to encounter a lot of wild fonts into
> these kind of diagrams. The more fonts you use, the harder the training. I
> would focus on very few fonts, even one. I would start with exactly one
> font and train on these to see quickly if my training setup/pipeline is
> working. And if the training results reflect onto the diagrams later. If
> the model error rate is good on the individual text lines and it is bad on
> the real images it might be a segmentation problem that training cannot
> fix. Or the problem might be the external box, that I suppose you do not
> have in your generated data.
>
> Ideally, I would use real crops from these diagrams rather than images
> from text2image.
>
> Also distinguishing 0 from O with many fonts is very hard. Often you have
> domain knowledge that can help you to fix these errors in post, for example
> 0,O5 can be easily spotted and fixed. You can, for example, assume that
> each box contains only one kind of data and guess the most likely one from
> this or from the box sequence, etc.
>
> I got good results with 20k samples (real world scanned docs, multi
> fonts). 10k seems reasonable, I also assume your output "characters set" is
> very small, like the numbers and a few capital letters and a couple of
> symbols (no %, ^, &, {, etc.).
>
>
>
> Lorenzo
>
> Il giorno gio 23 nov 2023 alle ore 10:16 Simon <[email protected]> ha
> scritto:
>
>> If I need to train new characters that are not recognized by a default
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity: ∠
>>
>> This symbols appear in technical drawings and should be recognised in
>> those. E.g. for the scenario in the following picture tesseract should
>> reconize this symbol.
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with:
>> [image: angularity_0_r0.jpg]
>> They all look pretty similar to this one. Things that change are the
>> angle, the propotion and the thickness of the lines. All examples have this
>> 64x64 pixel box around it.
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find
>> information for fine tuning for specific fonts. For fine tune also the
>> "tesstrain" repository would not be needed as it is used for training from
>> scratch, correct?
>> [email protected] schrieb am Mittwoch, 22. November 2023 um 15:27:02
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train
>>> from scratch. If you can't make more than that data, you might first try to
>>> fine tune:and then train by removing the top layer of the best model.
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 [email protected]
>>> wrote:
>>>
>>>> As it is not properly possible to combine my traineddata from scratch
>>>> with an existing one, I have decided to also train my traineddata model
>>>> numbers. Therefore I wrote a script which synthetically generates
>>>> groundtruth data with text2image.
>>>> This script uses dozens of different fonts and creates numbers for the
>>>> following formats.
>>>> X.XXX
>>>> X.XX
>>>> X,XX
>>>> X,XXX
>>>> I generated 10,000 files to train the numbers. But unfortunately
>>>> numbers get recognized pretty poorly with the best model. (most of times
>>>> only "0."; "0" or "0," gets recognized)
>>>> So I wanted to ask if It is not enough training (ground truth data) for
>>>> proper recognition when I train several fonts.
>>>> Thanks in advance for you help.
>>>>
>>> --
>>
> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/31d6a1f5-d114-485b-b6b3-897c57616783n%40googlegroups.com.