There are a few "commas" used in CJK which makes it complicated for me.

*FULLWIDTH COMMA U+FF0C* (link <https://www.compart.com/en/unicode/U+FF0C>) 
which might have the glyph in the center of the box or in the lower left 
corner depending on the font:

[image: Screenshot 2023-10-18 at 17.19.27.png] [image: commaFullWidth.jpg]

*HALFWIDTH IDEOGRAPHIC COMMA U+FF64* (link 
<https://www.compart.com/en/unicode/U+FF64>) which (as far as I can tell) 
will always be in the bottom corner regardless of font. (used to enumerate 
sequences)
[image: Screenshot 2023-10-18 at 17.23.33.png]

*COMMA U+002C*, (link <https://www.compart.com/en/unicode/U+002C>) which 
isn't part of formal CJK languages but in practice is used all the time
[image: Screenshot 2023-10-18 at 17.21.50.png]

So I'd like to train to recognize the three types of commas so the OCR 
output is matches the input images.  "FULLWIDTH COMMA" is a problem because 
the glyph position in the box is different depending on the font.  Hence my 
question "where and how big is the box?"

[image: Screenshot 2023-10-18 at 17.28.40.png]

In the image above, lines 1, 2, and 3 are all FULLWIDTH COMMA but line 1 is 
a different font.  Line 4 is COMMA (U+002C) while line 5 is HALFWIDTH 
IDEOGRAPHIC COMMA U+FF64.

What's the best way to train given those types of input and the expected 
output?

Danny
On Wednesday, October 18, 2023 at 1:22:25 PM UTC+8 [email protected] wrote:

> If the space is included in the training across the board, the model might 
> not recognize  the comma when it appears without space  (as in numbers: 
> 23,334). 
>
> On Wednesday, October 18, 2023 at 5:29:13 AM UTC+3 Danny wrote:
>
>> For purposes of training, I'm wondering if the box for a character should 
>> include the surrounding space. 
>>
>> In particular for the CJK "FULLWIDTH COMMA", should the box be the red or 
>> green rectangle? 
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/abd73cf8-bc53-44b8-8a4c-f5c494503066n%40googlegroups.com.

Reply via email to