On Sat, 30 Apr 2011, Enrico Segre wrote:

Thx. I tried your patch in rev 581, and on my test page it worked like
I expected only if I cropped the image very close to the left text
margin. With a larger margin, a line break is inserted at every line.
It may well have to do with the changed behavior you mention, and
might be reflected in the (unpatched) hocr output I've seen. I used a
scanned book image as test, so it may also be that image dirt in the
left margin fools the layout detection, but I think it is less likely.

The code for detecting blocks seems to be broken again in rev 581.
That's probably why the hocr output is wrong, as well.  If you
check out rev 549, the patch should work properly there (use
"svn update -r 549").  Or you can wait a bit and the block detection
code will probably be fixed again sometime soon.


Also, I realize that for PG I would like a blank line as well between
unindented paragraphs, if there is white space between them (thought
breaks) - but that is not what I asked in the first place.

Then you should probably wait for the hocr output to work, or hack
the GetUTF8Text() method in baseapi.cpp yourself to use the
IsParagraphBreak() method.  My simple patch definitely won't handle
that correctly.

Rob

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to