Try to look at leptonica sample programs about column splitting to see if you can preprocess the image better, before giving to tesseract
On Wed 11 Apr, 2018, 11:46 AM Ewan Mellor, <ewanmel...@gmail.com> wrote: > Hi, > > > I am using Tesseract 4 (git 10f4998a) to process a file with two columns. > A snippet of the image is shown below. The problem is that there is a > fuzzy line between the two columns, and the column detector has got > confused. I've ended up with one block covering the first column up to > "The" on the second line, but then a block covering both columns with the > "patient has ..." all the way across to "history of low". > > > I've looked in the debug views, and it looks to me like the line removal > hasn't managed to remove that fuzzy line down the middle. The "good" is > then close enough that the column finder is deciding to merge the two > blocks on that line. > > > Looking at the code in linefind.cpp and colfind.cpp, I see lots of > constants for various thresholds, but I don't see any configurable ones, > and I'm not sure which way to go now. Would it be better to work on the > line detector in linefind.cpp and try and get rid of that vertical line? > Or would I be better to run a columnar histogram and try and do column > splitting myself? Or should I ignore the fact that the line hasn't been > removed, and concentrate on tightening up the column finder so that it's > able to separate these two columns correctly? It seems to me that there's > enough of a gap there that it ought to be able to separate the columns (it > does a pretty good job on the rest of the document, so it can't be far off). > > > Any recommendations would be appreciated. > > > Thanks, > > > Ewan. > > > > > > <https://lh3.googleusercontent.com/-mrxB3T8S4fM/Ws1h25mfleI/AAAAAAAACoc/fJi8OkO6wswexnYDZU2uoofSRBCYmPiVwCLcBGAs/s1600/Screen%2BShot%2B2018-04-10%2Bat%2B6.12.48%2BPM.png> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to firstname.lastname@example.org. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/bdee5651-c305-4bbb-a14c-ccd5ba5cd7e2%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/bdee5651-c305-4bbb-a14c-ccd5ba5cd7e2%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to email@example.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVwFi%3D-hNX_scaod%2Ba7Pp0esJmCz3MtLSAkM7PAVq%3Ddw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.