Try psm 6, also 11, 12 https://github.com/tesseract-ocr/tesseract/issues/434
On 13 Oct 2016 1:13 p.m., "fuzzy7k" <[email protected]> wrote: > I tried psm 0-3 > > On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote: >> >> Which page segmentation mode (psm) did you try? >> >> On 12 Oct 2016 11:21 p.m., "fuzzy7k" <[email protected]> wrote: >> >>> I have scanned some index pages that I would like to ocr for rapid >>> searching. I am using tesseract from the command line. The problem is that >>> tesseract ignores the whitespace between columns and merges everything >>> together, essentially fragmenting the contents. Using some debug output I >>> see that no "columns" are detected. Probably more important is that three >>> "blocks" are detected, one around the first and last line, and one >>> encompassing everything in between. Is there a way to train block >>> detection, or some parameters that I can tweak to optimize this? >>> >>> I have attached the image merely as an abstract representation of the >>> text layout to show the types of columns I am dealing with. Ideally, it >>> would also be nice to know if tab stops can be trained and used to oneline >>> each individual topic, which I could do postprocess if I could get tabstops >>> printed. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU5LPcbcyiW4D-z5_uSY%2BLVUeRNTGniwn1%2BS26YLTPmGw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

