Thx. I tried your patch in rev 581, and on my test page it worked like I expected only if I cropped the image very close to the left text margin. With a larger margin, a line break is inserted at every line. It may well have to do with the changed behavior you mention, and might be reflected in the (unpatched) hocr output I've seen. I used a scanned book image as test, so it may also be that image dirt in the left margin fools the layout detection, but I think it is less likely.
Also, I realize that for PG I would like a blank line as well between unindented paragraphs, if there is white space between them (thought breaks) - but that is not what I asked in the first place. Enrico > I ended up hacking TessBaseAPI::GetUTF8Text() in api/baseapi.cpp > to add a linefeed before indented text. It is a very simple hack, > and will probably fail with poetry and other ragged-left layouts, > but it gets most of the simple prose paragraphs right. It also > has the problem of not working when applied to revisions of tesseract > where the block detection code has changed behaviour (like it has under > the current revision 581). I know that it works under revision 549, > so if you check that out and apply the attached patch, you should > get a blank line appearing before each indented line. > > Cheers, > Rob Komar > > baseapi.cpp.diff > 1KViewDownload -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

