On Mon, 25 Apr 2011, Enrico Segre wrote:
I'm striving to use tesseract for providing content to the Project
Gutenberg. There, proofing workflow requires that one blank line is
inserted between each recognized paragraph, paragraphs being defined
by a changing indentation of their first line w.r.o. the body text.
I found this old post:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/34ab77d8dd1636e3/35e59c6a67661ee3?lnk=gst&q=paragraph#35e59c6a67661ee3
Am I understanding correctly that the situation hasn't changed since
then, or is there a way?
Enrico
I ended up hacking TessBaseAPI::GetUTF8Text() in api/baseapi.cpp
to add a linefeed before indented text. It is a very simple hack,
and will probably fail with poetry and other ragged-left layouts,
but it gets most of the simple prose paragraphs right. It also
has the problem of not working when applied to revisions of tesseract
where the block detection code has changed behaviour (like it has under
the current revision 581). I know that it works under revision 549,
so if you check that out and apply the attached patch, you should
get a blank line appearing before each indented line.
Cheers,
Rob Komar
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
--- api/baseapi.cpp.orig 2010-12-12 16:42:14.279404997 -0800
+++ api/baseapi.cpp 2010-12-12 16:51:46.886404997 -0800
@@ -791,7 +791,7 @@
if (tesseract_ == NULL ||
(!recognition_done_ && Recognize(NULL) < 0))
return NULL;
- int total_length = TextLength(NULL);
+ int total_length = TextLength(NULL)+64; //+64 Add space for paragraph
breaks
PAGE_RES_IT page_res_it(page_res_);
char* result = new char[total_length];
char* ptr = result;
@@ -800,6 +800,15 @@
WERD_RES *word = page_res_it.word();
WERD_CHOICE* choice = word->best_choice;
if (choice != NULL) {
+ if (word->word->flag(W_BOL)) {
+ //If the first word is indented by more than half the row height
+ //from left side of the current block, add a paragraph break
+ int minParIndent = page_res_it.row()->row->x_height()/2;
+ int word_xstart = word->word->bounding_box().left();
+ int block_xstart = page_res_it.block()->block->bounding_box().left();
+ if ((word_xstart-block_xstart)>minParIndent)
+ *ptr++ = '\n';
+ }
strcpy(ptr, choice->unichar_string().string());
ptr += choice->unichar_string().length();
if (word->word->flag(W_EOL))