[tesseract-ocr] Re: Should box include surrounding space?

Des Bw Wed, 18 Oct 2023 05:45:24 -0700

But, if your options are only to manually edit the boxes, I really have no 
knowledge of it. I have never tried that route.


On Wednesday, October 18, 2023 at 3:43:51 PM UTC+3 Des Bw wrote:

> You need a large  data. That is all. 
> If you can collect a lot of text lines that contain all those types of 
> commas: and produce the training material using text2image (synthetic data) 
> for each font, I am pretty sure Tesseract will learn all of them with no 
> problem. 
>
> On Wednesday, October 18, 2023 at 12:35:01 PM UTC+3 Danny wrote:
>
>> There are a few "commas" used in CJK which makes it complicated for me.
>>
>> *FULLWIDTH COMMA U+FF0C* (link 
>> <https://www.compart.com/en/unicode/U+FF0C>) which might have the glyph 
>> in the center of the box or in the lower left corner depending on the font:
>>
>> [image: Screenshot 2023-10-18 at 17.19.27.png] [image: 
>> commaFullWidth.jpg]
>>
>> *HALFWIDTH IDEOGRAPHIC COMMA U+FF64* (link 
>> <https://www.compart.com/en/unicode/U+FF64>) which (as far as I can 
>> tell) will always be in the bottom corner regardless of font. (used to 
>> enumerate sequences)
>> [image: Screenshot 2023-10-18 at 17.23.33.png]
>>
>> *COMMA U+002C*, (link <https://www.compart.com/en/unicode/U+002C>) which 
>> isn't part of formal CJK languages but in practice is used all the time
>> [image: Screenshot 2023-10-18 at 17.21.50.png]
>>
>> So I'd like to train to recognize the three types of commas so the OCR 
>> output is matches the input images.  "FULLWIDTH COMMA" is a problem because 
>> the glyph position in the box is different depending on the font.  Hence my 
>> question "where and how big is the box?"
>>
>> [image: Screenshot 2023-10-18 at 17.28.40.png]
>>
>> In the image above, lines 1, 2, and 3 are all FULLWIDTH COMMA but line 1 
>> is a different font.  Line 4 is COMMA (U+002C) while line 5 is HALFWIDTH 
>> IDEOGRAPHIC COMMA U+FF64.
>>
>> What's the best way to train given those types of input and the expected 
>> output?
>>
>> Danny
>> On Wednesday, October 18, 2023 at 1:22:25 PM UTC+8 [email protected] 
>> wrote:
>>
>>> If the space is included in the training across the board, the model 
>>> might not recognize  the comma when it appears without space  (as in 
>>> numbers: 23,334). 
>>>
>>> On Wednesday, October 18, 2023 at 5:29:13 AM UTC+3 Danny wrote:
>>>
>>>> For purposes of training, I'm wondering if the box for a character 
>>>> should include the surrounding space. 
>>>>
>>>> In particular for the CJK "FULLWIDTH COMMA", should the box be the red 
>>>> or green rectangle? 
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/226ef38b-aa36-4f23-bfbd-78a2c454c627n%40googlegroups.com.

[tesseract-ocr] Re: Should box include surrounding space?

Reply via email to