Re: [tesseract-ocr] Re: Page layout analysis module

Age Bosma Tue, 08 Mar 2016 08:25:02 -0800

Hi Zdenko,

Man, would I have liked getting that hint 5 years ago... :-/


Best regards,

Age Bosma


On Tuesday, 8 March 2016 16:56:36 UTC+1, zdenop wrote:
>
> IMO it is - in hocr (xml) output or tsv (in master branch a.k.a 3.05)
>
> Zdenko
>
> On Tue, Mar 8, 2016 at 3:14 PM, Age Bosma <[email protected] <javascript:>
> > wrote:
>
>> Hi Teng,
>>
>> The options I mention aren't available in tesseract. I listed them as 
>> suggestions for extending tesseract. They haven't been implemented as far 
>> as I know.
>>
>> Best regards,
>>
>> Age
>>
>>
>>
>> On Monday, 7 March 2016 09:56:40 UTC+1, Teng Long wrote:
>>>
>>>
>>> Hi Age, I'm a newbie in OCR.
>>> You mentioned 3 option to use tesseract, 
>>> could you please tell me how to use this 3 options?
>>>
>>> any command is appreciated.
>>> Like:
>>>        tesseract sample2.jpg ouput -l eng -psm 3
>>>
>>> Thank you !
>>>
>>> On Monday, June 20, 2011 at 8:19:03 PM UTC+8, Age Bosma wrote:
>>>>
>>>> Thank you for your reply.
>>>>
>>>> Nice to learn that it is possible programming-wise. I should, however,
>>>> have been more clear that I was referring to command-line functionality.
>>>>
>>>> Would it be an idea to extend the tesseract command-line tools to have
>>>> it output containing block dimensions?
>>>>
>>>> So one option to output just the text (current behaviour):
>>>> --------------------------------
>>>> Some text
>>>> And yet again some other text
>>>> --------------------------------
>>>>
>>>> A second option to output the text marked with it's block dimensions:
>>>> --------------------------------
>>>> [block:10,20,250,20]
>>>> Some text
>>>> [block:350,400,600,410]
>>>> And yet again some other text
>>>> --------------------------------
>>>>
>>>> A a third option to output just all blocks:
>>>> --------------------------------
>>>> [block:10,20,250,20]
>>>> [block:350,400,600,410]
>>>> --------------------------------
>>>>
>>>> Yours,
>>>>
>>>> Age
>>>>
>>>>
>>>> On 20-06-11 11:56, patrickq wrote:
>>>> > You can definitely get just layout analysis before text recognition -
>>>> > look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
>>>> > structure. You can then iterate through that structure to look at
>>>> > blocks and rows within these blocks. Keep in mind that a sentence in
>>>> > the image could be broken out into separate boxes altogether if you
>>>> > have anything more complex than a simple page, so you'll have to do
>>>> > the stiching yourself of rows in entirely different boxes, based on
>>>> > their coordinates. There are even cases where you might get
>>>> > "Patrick"returned as one row containing "Ptrik" and one row containing
>>>> > "ic" - rare but happens too, especially when the text line has a slope
>>>> > (even if very moderate).
>>>> > 
>>>> > Patrick
>>>> > 
>>>> > On Jun 19, 4:07 pm, Prodoc <[email protected]> wrote:
>>>> >> Hi,
>>>> >>
>>>> >> In version 3 of tesseract-ocr there's a new page layout analysis
>>>> >> module. I'm interested to learn in what way it is used and how it can
>>>> >> be used.
>>>> >>
>>>> >> Does it provide additional user functionality or is it only used
>>>> >> internally? I.e. can I query it somehow to output all recognized text
>>>> >> areas (position and dimensions) without its actual text content?
>>>> >> Does it have any influence on the mark-up of the text output? I.e.
>>>> >> e.g. additional line breaks between text in case of a new paragraph.
>>>> >> I've played with the different pagesegmode values (0-3) but it gives
>>>> >> me the exact same output for each of them. Do these settings have
>>>> >> anything to do with the layout analysis?
>>>> >>
>>>> >> If recognizing text areas is what it does but you can't output just
>>>> >> the position and dimensions of them, it would be great to see this as
>>>> >> a new feature. In a program like gImageReader you have to do this
>>>> >> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's
>>>> >> analysis is more accurate, one could use that as an input for
>>>> >> OCRFeeder again.
>>>> >>
>>>> >> Yours,
>>>> >>
>>>> >> Age Bosma
>>>> > 
>>>>
>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d1929843-318e-42e5-a0e8-8d47108d6a4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Page layout analysis module

Reply via email to