tesseract 4 has been trained on line images and hence gives better results for lines, as far as I have seen.
On Sun, Jun 2, 2019 at 2:52 PM Jorge Castrillo <jorgemcastri...@gmail.com> wrote: > Hi everyone. I'm making a program on that uses tesseract to get a word > from a manga with a snipping-tool like program, and translates that word > with JMdict. > The thing is tesseract gives weird values for vertical, small selections. > I'm going to explain it in more detail: > > > Say I get a full horizontal line in Japanese, like the following one: > > [image: horizontal_full.jpg] > The output "元来日本語は漢文に倣い、文字を上" is perfect > > Getting a full vertical line gives no problems either: > > [image: vertical_full.jpg] > > Gives the same correct output. Now if I want to get only words, when > examining horizontal text there are no problems, while with the vertical > text the output is almost always (except when examining a Kanji alone) > wrong, like this: > > [image: nih-horizontal.jpg] > > > [image: nih-vertical.jpg] > > > The first one returns 日本語 while the second one returns 髑升田. > They are both from the same file, same size, same font, yet the results > vary greatly- > > > Another example, this time from a manga: > > [image: ej2full.jpg] > > The output is 今日の勝敗よりも, again, correct. > But going word by word we start to have errors: > > [image: eje2-word1.jpg] > Output 由」〉 > > and > > [image: ej2-word.jpg] > Output 健雛 > > Why is it that it can examine the full line without problem, but have so > much trouble getting vertical words? I am using psm 8 for words, but it > only seems to work with horizontal ones, and I can't get my head around it. > I've been trying to find a solution to this all day, but without success. > I'm not an expert programmer by any means, this is more of a college > project, but any insight would be really, really appreciated. Thank you for > reading. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/71b34e0f-5713-42d3-9ba0-4926291758cb%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/71b34e0f-5713-42d3-9ba0-4926291758cb%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWb0n%2Bie5ukkq7bRxtuD%2Bx6iQWYV5KK1b19s6yT-NhS1Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.