Hi Zdenko, Man, would I have liked getting that hint 5 years ago... :-/
Best regards, Age Bosma On Tuesday, 8 March 2016 16:56:36 UTC+1, zdenop wrote: > > IMO it is - in hocr (xml) output or tsv (in master branch a.k.a 3.05) > > Zdenko > > On Tue, Mar 8, 2016 at 3:14 PM, Age Bosma <[email protected] <javascript:> > > wrote: > >> Hi Teng, >> >> The options I mention aren't available in tesseract. I listed them as >> suggestions for extending tesseract. They haven't been implemented as far >> as I know. >> >> Best regards, >> >> Age >> >> >> >> On Monday, 7 March 2016 09:56:40 UTC+1, Teng Long wrote: >>> >>> >>> Hi Age, I'm a newbie in OCR. >>> You mentioned 3 option to use tesseract, >>> could you please tell me how to use this 3 options? >>> >>> any command is appreciated. >>> Like: >>> tesseract sample2.jpg ouput -l eng -psm 3 >>> >>> Thank you ! >>> >>> On Monday, June 20, 2011 at 8:19:03 PM UTC+8, Age Bosma wrote: >>>> >>>> Thank you for your reply. >>>> >>>> Nice to learn that it is possible programming-wise. I should, however, >>>> have been more clear that I was referring to command-line functionality. >>>> >>>> Would it be an idea to extend the tesseract command-line tools to have >>>> it output containing block dimensions? >>>> >>>> So one option to output just the text (current behaviour): >>>> -------------------------------- >>>> Some text >>>> And yet again some other text >>>> -------------------------------- >>>> >>>> A second option to output the text marked with it's block dimensions: >>>> -------------------------------- >>>> [block:10,20,250,20] >>>> Some text >>>> [block:350,400,600,410] >>>> And yet again some other text >>>> -------------------------------- >>>> >>>> A a third option to output just all blocks: >>>> -------------------------------- >>>> [block:10,20,250,20] >>>> [block:350,400,600,410] >>>> -------------------------------- >>>> >>>> Yours, >>>> >>>> Age >>>> >>>> >>>> On 20-06-11 11:56, patrickq wrote: >>>> > You can definitely get just layout analysis before text recognition - >>>> > look at the FindLinesCreateBlockList() API and the BLOCK_LIST data >>>> > structure. You can then iterate through that structure to look at >>>> > blocks and rows within these blocks. Keep in mind that a sentence in >>>> > the image could be broken out into separate boxes altogether if you >>>> > have anything more complex than a simple page, so you'll have to do >>>> > the stiching yourself of rows in entirely different boxes, based on >>>> > their coordinates. There are even cases where you might get >>>> > "Patrick"returned as one row containing "Ptrik" and one row containing >>>> > "ic" - rare but happens too, especially when the text line has a slope >>>> > (even if very moderate). >>>> > >>>> > Patrick >>>> > >>>> > On Jun 19, 4:07 pm, Prodoc <[email protected]> wrote: >>>> >> Hi, >>>> >> >>>> >> In version 3 of tesseract-ocr there's a new page layout analysis >>>> >> module. I'm interested to learn in what way it is used and how it can >>>> >> be used. >>>> >> >>>> >> Does it provide additional user functionality or is it only used >>>> >> internally? I.e. can I query it somehow to output all recognized text >>>> >> areas (position and dimensions) without its actual text content? >>>> >> Does it have any influence on the mark-up of the text output? I.e. >>>> >> e.g. additional line breaks between text in case of a new paragraph. >>>> >> I've played with the different pagesegmode values (0-3) but it gives >>>> >> me the exact same output for each of them. Do these settings have >>>> >> anything to do with the layout analysis? >>>> >> >>>> >> If recognizing text areas is what it does but you can't output just >>>> >> the position and dimensions of them, it would be great to see this as >>>> >> a new feature. In a program like gImageReader you have to do this >>>> >> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's >>>> >> analysis is more accurate, one could use that as an input for >>>> >> OCRFeeder again. >>>> >> >>>> >> Yours, >>>> >> >>>> >> Age Bosma >>>> > >>>> >>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d1929843-318e-42e5-a0e8-8d47108d6a4b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

