negative On Friday, October 14, 2016 at 3:29:53 AM UTC-4, shree wrote: > > You can also experiment with hocr and tsv output modes to see if they help. > > On 14 Oct 2016 2:53 a.m., "fuzzy7k" <kva...@gmail.com <javascript:>> > wrote: > >> Going back to psm 3, I did find that textord_tabfind_find_tables 0 >> helped, in that it draws only one box around the "block" of text, instead >> of the three that I was first getting. This is obviously the same as psm 6, >> but psm 6 should not run column detection, which is something that I want >> unless I can get tesseract to draw "blocks" vertically around the >> individual columns. >> >> On Thursday, October 13, 2016 at 8:30:05 PM UTC-4, fuzzy7k wrote: >>> >>> 6 gives the exact same results as 3 (i.e. no column separation). 11 & 12 >>> are essentially the same in that they pull text from left to right, but >>> with three times as many newlines. >>> >>> On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote: >>>> >>>> Try psm 6, also 11, 12 >>>> >>>> https://github.com/tesseract-ocr/tesseract/issues/434 >>>> >>>> On 13 Oct 2016 1:13 p.m., "fuzzy7k" <kva...@gmail.com> wrote: >>>> >>>>> I tried psm 0-3 >>>>> >>>>> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote: >>>>>> >>>>>> Which page segmentation mode (psm) did you try? >>>>>> >>>>>> On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote: >>>>>> >>>>>>> I have scanned some index pages that I would like to ocr for rapid >>>>>>> searching. I am using tesseract from the command line. The problem is >>>>>>> that >>>>>>> tesseract ignores the whitespace between columns and merges everything >>>>>>> together, essentially fragmenting the contents. Using some debug output >>>>>>> I >>>>>>> see that no "columns" are detected. Probably more important is that >>>>>>> three >>>>>>> "blocks" are detected, one around the first and last line, and one >>>>>>> encompassing everything in between. Is there a way to train block >>>>>>> detection, or some parameters that I can tweak to optimize this? >>>>>>> >>>>>>> I have attached the image merely as an abstract representation of >>>>>>> the text layout to show the types of columns I am dealing with. >>>>>>> Ideally, it >>>>>>> would also be nice to know if tab stops can be trained and used to >>>>>>> oneline >>>>>>> each individual topic, which I could do postprocess if I could get >>>>>>> tabstops >>>>>>> printed. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31bc93d0-863b-4d35-b608-9dba08726d53%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.