Re: Page layout analysis module

Age Bosma Mon, 20 Jun 2011 06:49:58 -0700

Thank you for your reply.

Nice to learn that it is possible programming-wise. I should, however,
have been more clear that I was referring to command-line functionality.


Would it be an idea to extend the tesseract command-line tools to have
it output containing block dimensions?

So one option to output just the text (current behaviour):
--------------------------------
Some text
And yet again some other text
--------------------------------

A second option to output the text marked with it's block dimensions:
--------------------------------
[block:10,20,250,20]
Some text
[block:350,400,600,410]
And yet again some other text
--------------------------------

A a third option to output just all blocks:
--------------------------------
[block:10,20,250,20]
[block:350,400,600,410]
--------------------------------

Yours,

Age


On 20-06-11 11:56, patrickq wrote:
> You can definitely get just layout analysis before text recognition -
> look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
> structure. You can then iterate through that structure to look at
> blocks and rows within these blocks. Keep in mind that a sentence in
> the image could be broken out into separate boxes altogether if you
> have anything more complex than a simple page, so you'll have to do
> the stiching yourself of rows in entirely different boxes, based on
> their coordinates. There are even cases where you might get
> "Patrick"returned as one row containing "Ptrik" and one row containing
> "ic" - rare but happens too, especially when the text line has a slope
> (even if very moderate).
> 
> Patrick
> 
> On Jun 19, 4:07 pm, Prodoc <[email protected]> wrote:
>> Hi,
>>
>> In version 3 of tesseract-ocr there's a new page layout analysis
>> module. I'm interested to learn in what way it is used and how it can
>> be used.
>>
>> Does it provide additional user functionality or is it only used
>> internally? I.e. can I query it somehow to output all recognized text
>> areas (position and dimensions) without its actual text content?
>> Does it have any influence on the mark-up of the text output? I.e.
>> e.g. additional line breaks between text in case of a new paragraph.
>> I've played with the different pagesegmode values (0-3) but it gives
>> me the exact same output for each of them. Do these settings have
>> anything to do with the layout analysis?
>>
>> If recognizing text areas is what it does but you can't output just
>> the position and dimensions of them, it would be great to see this as
>> a new feature. In a program like gImageReader you have to do this
>> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's
>> analysis is more accurate, one could use that as an input for
>> OCRFeeder again.
>>
>> Yours,
>>
>> Age Bosma
>

signature.asc
Description: OpenPGP digital signature

Re: Page layout analysis module

Reply via email to