But, if your options are only to manually edit the boxes, I really have no knowledge of it. I have never tried that route.
On Wednesday, October 18, 2023 at 3:43:51 PM UTC+3 Des Bw wrote: > You need a large data. That is all. > If you can collect a lot of text lines that contain all those types of > commas: and produce the training material using text2image (synthetic data) > for each font, I am pretty sure Tesseract will learn all of them with no > problem. > > On Wednesday, October 18, 2023 at 12:35:01 PM UTC+3 Danny wrote: > >> There are a few "commas" used in CJK which makes it complicated for me. >> >> *FULLWIDTH COMMA U+FF0C* (link >> <https://www.compart.com/en/unicode/U+FF0C>) which might have the glyph >> in the center of the box or in the lower left corner depending on the font: >> >> [image: Screenshot 2023-10-18 at 17.19.27.png] [image: >> commaFullWidth.jpg] >> >> *HALFWIDTH IDEOGRAPHIC COMMA U+FF64* (link >> <https://www.compart.com/en/unicode/U+FF64>) which (as far as I can >> tell) will always be in the bottom corner regardless of font. (used to >> enumerate sequences) >> [image: Screenshot 2023-10-18 at 17.23.33.png] >> >> *COMMA U+002C*, (link <https://www.compart.com/en/unicode/U+002C>) which >> isn't part of formal CJK languages but in practice is used all the time >> [image: Screenshot 2023-10-18 at 17.21.50.png] >> >> So I'd like to train to recognize the three types of commas so the OCR >> output is matches the input images. "FULLWIDTH COMMA" is a problem because >> the glyph position in the box is different depending on the font. Hence my >> question "where and how big is the box?" >> >> [image: Screenshot 2023-10-18 at 17.28.40.png] >> >> In the image above, lines 1, 2, and 3 are all FULLWIDTH COMMA but line 1 >> is a different font. Line 4 is COMMA (U+002C) while line 5 is HALFWIDTH >> IDEOGRAPHIC COMMA U+FF64. >> >> What's the best way to train given those types of input and the expected >> output? >> >> Danny >> On Wednesday, October 18, 2023 at 1:22:25 PM UTC+8 [email protected] >> wrote: >> >>> If the space is included in the training across the board, the model >>> might not recognize the comma when it appears without space (as in >>> numbers: 23,334). >>> >>> On Wednesday, October 18, 2023 at 5:29:13 AM UTC+3 Danny wrote: >>> >>>> For purposes of training, I'm wondering if the box for a character >>>> should include the surrounding space. >>>> >>>> In particular for the CJK "FULLWIDTH COMMA", should the box be the red >>>> or green rectangle? >>>> >>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/226ef38b-aa36-4f23-bfbd-78a2c454c627n%40googlegroups.com.

