Thx. I tried your patch in rev 581, and on my test page it worked like
I expected only if I cropped the image very close to the left text
margin. With a larger margin, a line break is inserted at every line.
It may well have to do with the changed behavior you mention, and
might be reflected in the (unpatched) hocr output I've seen. I used a
scanned book image as test, so it may also be that image dirt in the
left margin fools the layout detection, but I think it is less likely.

Also, I realize that for PG I would like a blank line as well between
unindented paragraphs, if there is white space between them (thought
breaks) - but that is not what I asked in the first place.

Enrico

> I ended up hacking TessBaseAPI::GetUTF8Text() in api/baseapi.cpp
> to add a linefeed before indented text.  It is a very simple hack,
> and will probably fail with poetry and other ragged-left layouts,
> but it gets most of the simple prose paragraphs right.  It also
> has the problem of not working when applied to revisions of tesseract
> where the block detection code has changed behaviour (like it has under
> the current revision 581).  I know that it works under revision 549,
> so if you check that out and apply the attached patch, you should
> get a blank line appearing before each indented line.
>
> Cheers,
> Rob Komar
>
>  baseapi.cpp.diff
> 1KViewDownload

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to