You can also experiment with hocr and tsv output modes to see if they help.

On 14 Oct 2016 2:53 a.m., "fuzzy7k" <kvan...@gmail.com> wrote:

> Going back to psm 3, I did find that textord_tabfind_find_tables 0 helped,
> in that it draws only one box around the "block" of text, instead of the
> three that I was first getting. This is obviously the same as psm 6, but
> psm 6 should not run column detection, which is something that I want
> unless I can get tesseract to draw "blocks" vertically around the
> individual columns.
>
> On Thursday, October 13, 2016 at 8:30:05 PM UTC-4, fuzzy7k wrote:
>>
>> 6 gives the exact same results as 3 (i.e. no column separation). 11 & 12
>> are essentially the same in that they pull text from left to right, but
>> with three times as many newlines.
>>
>> On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote:
>>>
>>> Try psm 6, also 11, 12
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/434
>>>
>>> On 13 Oct 2016 1:13 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
>>>
>>>> I tried psm 0-3
>>>>
>>>> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>>>>>
>>>>> Which page segmentation mode (psm) did you try?
>>>>>
>>>>> On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
>>>>>
>>>>>> I have scanned some index pages that I would like to ocr for rapid
>>>>>> searching. I am using tesseract from the command line. The problem is 
>>>>>> that
>>>>>> tesseract ignores the whitespace between columns and merges everything
>>>>>> together, essentially fragmenting the contents. Using some debug output I
>>>>>> see that no "columns" are detected. Probably more important is that three
>>>>>> "blocks" are detected, one around the first and last line, and one
>>>>>> encompassing everything in between. Is there a way to train block
>>>>>> detection, or some parameters that I can tweak to optimize this?
>>>>>>
>>>>>> I have attached the image merely as an abstract representation of the
>>>>>> text layout to show the types of columns I am dealing with. Ideally, it
>>>>>> would also be nice to know if tab stops can be trained and used to 
>>>>>> oneline
>>>>>> each individual topic, which I could do postprocess if I could get 
>>>>>> tabstops
>>>>>> printed.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cea
>>>>>> d-4959-9260-52e98ee596b7%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUB6V9jnVs_o1SbNc_O8JiW%3Dgq3ihs4mOrgz%3DffnXuPAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to