On Sat, 30 Apr 2011, Enrico Segre wrote:
Thx. I tried your patch in rev 581, and on my test page it worked like I expected only if I cropped the image very close to the left text margin. With a larger margin, a line break is inserted at every line. It may well have to do with the changed behavior you mention, and might be reflected in the (unpatched) hocr output I've seen. I used a scanned book image as test, so it may also be that image dirt in the left margin fools the layout detection, but I think it is less likely.
The code for detecting blocks seems to be broken again in rev 581. That's probably why the hocr output is wrong, as well. If you check out rev 549, the patch should work properly there (use "svn update -r 549"). Or you can wait a bit and the block detection code will probably be fixed again sometime soon.
Also, I realize that for PG I would like a blank line as well between unindented paragraphs, if there is white space between them (thought breaks) - but that is not what I asked in the first place.
Then you should probably wait for the hocr output to work, or hack the GetUTF8Text() method in baseapi.cpp yourself to use the IsParagraphBreak() method. My simple patch definitely won't handle that correctly. Rob -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

