IMO it is - in hocr (xml) output or tsv (in master branch a.k.a 3.05) Zdenko
On Tue, Mar 8, 2016 at 3:14 PM, Age Bosma <agebo...@gmail.com> wrote: > Hi Teng, > > The options I mention aren't available in tesseract. I listed them as > suggestions for extending tesseract. They haven't been implemented as far > as I know. > > Best regards, > > Age > > > > On Monday, 7 March 2016 09:56:40 UTC+1, Teng Long wrote: >> >> >> Hi Age, I'm a newbie in OCR. >> You mentioned 3 option to use tesseract, >> could you please tell me how to use this 3 options? >> >> any command is appreciated. >> Like: >> tesseract sample2.jpg ouput -l eng -psm 3 >> >> Thank you ! >> >> On Monday, June 20, 2011 at 8:19:03 PM UTC+8, Age Bosma wrote: >>> >>> Thank you for your reply. >>> >>> Nice to learn that it is possible programming-wise. I should, however, >>> have been more clear that I was referring to command-line functionality. >>> >>> Would it be an idea to extend the tesseract command-line tools to have >>> it output containing block dimensions? >>> >>> So one option to output just the text (current behaviour): >>> -------------------------------- >>> Some text >>> And yet again some other text >>> -------------------------------- >>> >>> A second option to output the text marked with it's block dimensions: >>> -------------------------------- >>> [block:10,20,250,20] >>> Some text >>> [block:350,400,600,410] >>> And yet again some other text >>> -------------------------------- >>> >>> A a third option to output just all blocks: >>> -------------------------------- >>> [block:10,20,250,20] >>> [block:350,400,600,410] >>> -------------------------------- >>> >>> Yours, >>> >>> Age >>> >>> >>> On 20-06-11 11:56, patrickq wrote: >>> > You can definitely get just layout analysis before text recognition - >>> > look at the FindLinesCreateBlockList() API and the BLOCK_LIST data >>> > structure. You can then iterate through that structure to look at >>> > blocks and rows within these blocks. Keep in mind that a sentence in >>> > the image could be broken out into separate boxes altogether if you >>> > have anything more complex than a simple page, so you'll have to do >>> > the stiching yourself of rows in entirely different boxes, based on >>> > their coordinates. There are even cases where you might get >>> > "Patrick"returned as one row containing "Ptrik" and one row containing >>> > "ic" - rare but happens too, especially when the text line has a slope >>> > (even if very moderate). >>> > >>> > Patrick >>> > >>> > On Jun 19, 4:07 pm, Prodoc <agebo...@gmail.com> wrote: >>> >> Hi, >>> >> >>> >> In version 3 of tesseract-ocr there's a new page layout analysis >>> >> module. I'm interested to learn in what way it is used and how it can >>> >> be used. >>> >> >>> >> Does it provide additional user functionality or is it only used >>> >> internally? I.e. can I query it somehow to output all recognized text >>> >> areas (position and dimensions) without its actual text content? >>> >> Does it have any influence on the mark-up of the text output? I.e. >>> >> e.g. additional line breaks between text in case of a new paragraph. >>> >> I've played with the different pagesegmode values (0-3) but it gives >>> >> me the exact same output for each of them. Do these settings have >>> >> anything to do with the layout analysis? >>> >> >>> >> If recognizing text areas is what it does but you can't output just >>> >> the position and dimensions of them, it would be great to see this as >>> >> a new feature. In a program like gImageReader you have to do this >>> >> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's >>> >> analysis is more accurate, one could use that as an input for >>> >> OCRFeeder again. >>> >> >>> >> Yours, >>> >> >>> >> Age Bosma >>> > >>> >>> >>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z7wuqWALiL%2BSxWrca0P8EEvhuG6UvKSEWbpkV0nsJ2EQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.