Re: [tesseract-ocr] Failure to recognize columns

2016-10-14 Thread ShreeDevi Kumar
You can also experiment with hocr and tsv output modes to see if they help.

On 14 Oct 2016 2:53 a.m., "fuzzy7k"  wrote:

> Going back to psm 3, I did find that textord_tabfind_find_tables 0 helped,
> in that it draws only one box around the "block" of text, instead of the
> three that I was first getting. This is obviously the same as psm 6, but
> psm 6 should not run column detection, which is something that I want
> unless I can get tesseract to draw "blocks" vertically around the
> individual columns.
>
> On Thursday, October 13, 2016 at 8:30:05 PM UTC-4, fuzzy7k wrote:
>>
>> 6 gives the exact same results as 3 (i.e. no column separation). 11 & 12
>> are essentially the same in that they pull text from left to right, but
>> with three times as many newlines.
>>
>> On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote:
>>>
>>> Try psm 6, also 11, 12
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/434
>>>
>>> On 13 Oct 2016 1:13 p.m., "fuzzy7k"  wrote:
>>>
 I tried psm 0-3

 On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>
> Which page segmentation mode (psm) did you try?
>
> On 12 Oct 2016 11:21 p.m., "fuzzy7k"  wrote:
>
>> I have scanned some index pages that I would like to ocr for rapid
>> searching. I am using tesseract from the command line. The problem is 
>> that
>> tesseract ignores the whitespace between columns and merges everything
>> together, essentially fragmenting the contents. Using some debug output I
>> see that no "columns" are detected. Probably more important is that three
>> "blocks" are detected, one around the first and last line, and one
>> encompassing everything in between. Is there a way to train block
>> detection, or some parameters that I can tweak to optimize this?
>>
>> I have attached the image merely as an abstract representation of the
>> text layout to show the types of columns I am dealing with. Ideally, it
>> would also be nice to know if tab stops can be trained and used to 
>> oneline
>> each individual topic, which I could do postprocess if I could get 
>> tabstops
>> printed.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cea
>> d-4959-9260-52e98ee596b7%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/ms
 gid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40goo
 glegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 

Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
Going back to psm 3, I did find that textord_tabfind_find_tables 0 helped, 
in that it draws only one box around the "block" of text, instead of the 
three that I was first getting. This is obviously the same as psm 6, but 
psm 6 should not run column detection, which is something that I want 
unless I can get tesseract to draw "blocks" vertically around the 
individual columns.

On Thursday, October 13, 2016 at 8:30:05 PM UTC-4, fuzzy7k wrote:
>
> 6 gives the exact same results as 3 (i.e. no column separation). 11 & 12 
> are essentially the same in that they pull text from left to right, but 
> with three times as many newlines.
>
> On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote:
>>
>> Try psm 6, also 11, 12
>>
>> https://github.com/tesseract-ocr/tesseract/issues/434
>>
>> On 13 Oct 2016 1:13 p.m., "fuzzy7k"  wrote:
>>
>>> I tried psm 0-3
>>>
>>> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:

 Which page segmentation mode (psm) did you try?

 On 12 Oct 2016 11:21 p.m., "fuzzy7k"  wrote:

> I have scanned some index pages that I would like to ocr for rapid 
> searching. I am using tesseract from the command line. The problem is 
> that 
> tesseract ignores the whitespace between columns and merges everything 
> together, essentially fragmenting the contents. Using some debug output I 
> see that no "columns" are detected. Probably more important is that three 
> "blocks" are detected, one around the first and last line, and one 
> encompassing everything in between. Is there a way to train block 
> detection, or some parameters that I can tweak to optimize this?
>
> I have attached the image merely as an abstract representation of the 
> text layout to show the types of columns I am dealing with. Ideally, it 
> would also be nice to know if tab stops can be trained and used to 
> oneline 
> each individual topic, which I could do postprocess if I could get 
> tabstops 
> printed.
>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com
>  
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com
>>>  
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
6 gives the exact same results as 3 (i.e. no column separation). 11 & 12 
are essentially the same in that they pull text from left to right, but 
with three times as many newlines.

On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote:
>
> Try psm 6, also 11, 12
>
> https://github.com/tesseract-ocr/tesseract/issues/434
>
> On 13 Oct 2016 1:13 p.m., "fuzzy7k"  
> wrote:
>
>> I tried psm 0-3
>>
>> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>>>
>>> Which page segmentation mode (psm) did you try?
>>>
>>> On 12 Oct 2016 11:21 p.m., "fuzzy7k"  wrote:
>>>
 I have scanned some index pages that I would like to ocr for rapid 
 searching. I am using tesseract from the command line. The problem is that 
 tesseract ignores the whitespace between columns and merges everything 
 together, essentially fragmenting the contents. Using some debug output I 
 see that no "columns" are detected. Probably more important is that three 
 "blocks" are detected, one around the first and last line, and one 
 encompassing everything in between. Is there a way to train block 
 detection, or some parameters that I can tweak to optimize this?

 I have attached the image merely as an abstract representation of the 
 text layout to show the types of columns I am dealing with. Ideally, it 
 would also be nice to know if tab stops can be trained and used to oneline 
 each individual topic, which I could do postprocess if I could get 
 tabstops 
 printed.

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5ac4036b-fe2d-4a1e-aadb-fc6a6198d08b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread ShreeDevi Kumar
Try psm 6, also 11, 12

https://github.com/tesseract-ocr/tesseract/issues/434

On 13 Oct 2016 1:13 p.m., "fuzzy7k"  wrote:

> I tried psm 0-3
>
> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>>
>> Which page segmentation mode (psm) did you try?
>>
>> On 12 Oct 2016 11:21 p.m., "fuzzy7k"  wrote:
>>
>>> I have scanned some index pages that I would like to ocr for rapid
>>> searching. I am using tesseract from the command line. The problem is that
>>> tesseract ignores the whitespace between columns and merges everything
>>> together, essentially fragmenting the contents. Using some debug output I
>>> see that no "columns" are detected. Probably more important is that three
>>> "blocks" are detected, one around the first and last line, and one
>>> encompassing everything in between. Is there a way to train block
>>> detection, or some parameters that I can tweak to optimize this?
>>>
>>> I have attached the image merely as an abstract representation of the
>>> text layout to show the types of columns I am dealing with. Ideally, it
>>> would also be nice to know if tab stops can be trained and used to oneline
>>> each individual topic, which I could do postprocess if I could get tabstops
>>> printed.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU5LPcbcyiW4D-z5_uSY%2BLVUeRNTGniwn1%2BS26YLTPmGw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
I tried psm 0-3

On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>
> Which page segmentation mode (psm) did you try?
>
> On 12 Oct 2016 11:21 p.m., "fuzzy7k"  
> wrote:
>
>> I have scanned some index pages that I would like to ocr for rapid 
>> searching. I am using tesseract from the command line. The problem is that 
>> tesseract ignores the whitespace between columns and merges everything 
>> together, essentially fragmenting the contents. Using some debug output I 
>> see that no "columns" are detected. Probably more important is that three 
>> "blocks" are detected, one around the first and last line, and one 
>> encompassing everything in between. Is there a way to train block 
>> detection, or some parameters that I can tweak to optimize this?
>>
>> I have attached the image merely as an abstract representation of the 
>> text layout to show the types of columns I am dealing with. Ideally, it 
>> would also be nice to know if tab stops can be trained and used to oneline 
>> each individual topic, which I could do postprocess if I could get tabstops 
>> printed.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failure to recognize columns

2016-10-12 Thread ShreeDevi Kumar
Which page segmentation mode (psm) did you try?

On 12 Oct 2016 11:21 p.m., "fuzzy7k"  wrote:

> I have scanned some index pages that I would like to ocr for rapid
> searching. I am using tesseract from the command line. The problem is that
> tesseract ignores the whitespace between columns and merges everything
> together, essentially fragmenting the contents. Using some debug output I
> see that no "columns" are detected. Probably more important is that three
> "blocks" are detected, one around the first and last line, and one
> encompassing everything in between. Is there a way to train block
> detection, or some parameters that I can tweak to optimize this?
>
> I have attached the image merely as an abstract representation of the text
> layout to show the types of columns I am dealing with. Ideally, it would
> also be nice to know if tab stops can be trained and used to oneline each
> individual topic, which I could do postprocess if I could get tabstops
> printed.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVxWb22rBuReArRcOKkur1Oxd-tWfs%3D%2BTOgHoyDzmvkzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Failure to recognize columns

2016-10-12 Thread fuzzy7k
I have scanned some index pages that I would like to ocr for rapid 
searching. I am using tesseract from the command line. The problem is that 
tesseract ignores the whitespace between columns and merges everything 
together, essentially fragmenting the contents. Using some debug output I 
see that no "columns" are detected. Probably more important is that three 
"blocks" are detected, one around the first and last line, and one 
encompassing everything in between. Is there a way to train block 
detection, or some parameters that I can tweak to optimize this?

I have attached the image merely as an abstract representation of the text 
layout to show the types of columns I am dealing with. Ideally, it would 
also be nice to know if tab stops can be trained and used to oneline each 
individual topic, which I could do postprocess if I could get tabstops 
printed.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.