Thank you for your reply. Nice to learn that it is possible programming-wise. I should, however, have been more clear that I was referring to command-line functionality.
Would it be an idea to extend the tesseract command-line tools to have it output containing block dimensions? So one option to output just the text (current behaviour): -------------------------------- Some text And yet again some other text -------------------------------- A second option to output the text marked with it's block dimensions: -------------------------------- [block:10,20,250,20] Some text [block:350,400,600,410] And yet again some other text -------------------------------- A a third option to output just all blocks: -------------------------------- [block:10,20,250,20] [block:350,400,600,410] -------------------------------- Yours, Age On 20-06-11 11:56, patrickq wrote: > You can definitely get just layout analysis before text recognition - > look at the FindLinesCreateBlockList() API and the BLOCK_LIST data > structure. You can then iterate through that structure to look at > blocks and rows within these blocks. Keep in mind that a sentence in > the image could be broken out into separate boxes altogether if you > have anything more complex than a simple page, so you'll have to do > the stiching yourself of rows in entirely different boxes, based on > their coordinates. There are even cases where you might get > "Patrick"returned as one row containing "Ptrik" and one row containing > "ic" - rare but happens too, especially when the text line has a slope > (even if very moderate). > > Patrick > > On Jun 19, 4:07 pm, Prodoc <[email protected]> wrote: >> Hi, >> >> In version 3 of tesseract-ocr there's a new page layout analysis >> module. I'm interested to learn in what way it is used and how it can >> be used. >> >> Does it provide additional user functionality or is it only used >> internally? I.e. can I query it somehow to output all recognized text >> areas (position and dimensions) without its actual text content? >> Does it have any influence on the mark-up of the text output? I.e. >> e.g. additional line breaks between text in case of a new paragraph. >> I've played with the different pagesegmode values (0-3) but it gives >> me the exact same output for each of them. Do these settings have >> anything to do with the layout analysis? >> >> If recognizing text areas is what it does but you can't output just >> the position and dimensions of them, it would be great to see this as >> a new feature. In a program like gImageReader you have to do this >> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's >> analysis is more accurate, one could use that as an input for >> OCRFeeder again. >> >> Yours, >> >> Age Bosma >
signature.asc
Description: OpenPGP digital signature

