See https://groups.google.com/g/tesseract-ocr/c/GFHIZ8hO3c4/m/ieYUckMvBgAJ
On Friday, August 7, 2020 at 10:21:11 AM UTC+5:30 ultra wrote: > Hello zodiac, > > I'm trying to train vertical Japanese, but the documentation is not great > for vertical language. > Could you briefly describe the steps you took? > Is it line image with text file? Is it vertical line image or horizontal > line image? > > Thank you! :) > > On Monday, June 3, 2019 at 4:28:29 PM UTC-4 [email protected] wrote: > >> Are you using jpn_vert instead of jpn? >> I have trained jpn_vert >> >> https://github.com/zodiac3539/jpn_vert >> >> >> On Mon, Jun 3, 2019 at 11:31 AM Shree Devi Kumar <[email protected]> >> wrote: >> >>> tesseract 4 has been trained on line images and hence gives better >>> results for lines, as far as I have seen. >>> >>> On Sun, Jun 2, 2019 at 2:52 PM Jorge Castrillo <[email protected]> >>> wrote: >>> >>>> Hi everyone. I'm making a program on that uses tesseract to get a word >>>> from a manga with a snipping-tool like program, and translates that word >>>> with JMdict. >>>> The thing is tesseract gives weird values for vertical, small >>>> selections. I'm going to explain it in more detail: >>>> >>>> >>>> Say I get a full horizontal line in Japanese, like the following one: >>>> >>>> [image: horizontal_full.jpg] >>>> The output "元来日本語は漢文に倣い、文字を上" is perfect >>>> >>>> Getting a full vertical line gives no problems either: >>>> >>>> [image: vertical_full.jpg] >>>> >>>> Gives the same correct output. Now if I want to get only words, when >>>> examining horizontal text there are no problems, while with the vertical >>>> text the output is almost always (except when examining a Kanji alone) >>>> wrong, like this: >>>> >>>> [image: nih-horizontal.jpg] >>>> >>>> >>>> [image: nih-vertical.jpg] >>>> >>>> >>>> The first one returns 日本語 while the second one returns 髑升田. >>>> They are both from the same file, same size, same font, yet the results >>>> vary greatly- >>>> >>>> >>>> Another example, this time from a manga: >>>> >>>> [image: ej2full.jpg] >>>> >>>> The output is 今日の勝敗よりも, again, correct. >>>> But going word by word we start to have errors: >>>> >>>> [image: eje2-word1.jpg] >>>> Output 由」〉 >>>> >>>> and >>>> >>>> [image: ej2-word.jpg] >>>> Output 健雛 >>>> >>>> Why is it that it can examine the full line without problem, but have >>>> so much trouble getting vertical words? I am using psm 8 for words, but it >>>> only seems to work with horizontal ones, and I can't get my head around >>>> it. >>>> I've been trying to find a solution to this all day, but without success. >>>> I'm not an expert programmer by any means, this is more of a college >>>> project, but any insight would be really, really appreciated. Thank you >>>> for >>>> reading. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/71b34e0f-5713-42d3-9ba0-4926291758cb%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/71b34e0f-5713-42d3-9ba0-4926291758cb%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWb0n%2Bie5ukkq7bRxtuD%2Bx6iQWYV5KK1b19s6yT-NhS1Q%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWb0n%2Bie5ukkq7bRxtuD%2Bx6iQWYV5KK1b19s6yT-NhS1Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >> >> >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/42d65783-1264-429d-a5f4-a27ae44f5b65n%40googlegroups.com.

