That looks right, thanks for that. I'll try to take a proper look soon and figure out how best to upstream stuff, and where it's worth doing so. In the meantime I'll attach the .diff (very small; only 200 lines), in case anyone else is interested, and so I don't forget ;)
Nick On Wed, May 15, 2013 at 07:18:42AM -0700, Renard Wellnitz wrote: > Hi Nick, > > here is the console output: > > > localhost:tesseract-ocr-3.02 renard$ svn log -r COMMITTED > ------------------------------------------------------------------------ > r705 | zde...@gmail.com | 2012-03-15 22:05:12 +0100 (Thu, 15 Mar 2012) | 1 > line > > fixed build in java directory; create documentation package with 'make > doc-pack' > ------------------------------------------------------------------------ > > > Cheers > Renard > > > Am Mittwoch, 15. Mai 2013 14:28:35 UTC+2 schrieb Nick White: > > I'm no expert with SVN, but I think this command will tell me what I > want to know: > > svn log -r COMMITTED > > Thanks. > > On Wed, May 15, 2013 at 04:02:34AM -0700, Renard Wellnitz wrote: > > Hi Nick, > > > > i'm not really proficient with svn. Maybe this helps? If you want me to > run a > > specific svn command i'll gladly do it. > > > > > > localhost:tesseract-ocr-3.02 renard$ svn ls "^/tags" > > release-2.04/ > > release-3.00/ > > release-3.00.1/ > > release-3.01/ > > release-3.02.01/ > > release-3.02.02/ > > localhost:tesseract-ocr-3.02 renard$ svnversion . > > 705M > > localhost:tesseract-ocr-3.02 renard$ > > > > > > I do not remember the exact changes. But my main goals was the get > progress > > information during the ocr process so that my app could show the > bounding > boxes > > of the currently processed word. > > > > Cheers > > Renard > > > > > > Am Mittwoch, 15. Mai 2013 11:37:26 UTC+2 schrieb Nick White: > > > > Ah, I see it's pretty close to 3.02.01 (now only available as an SVN > > tag). Am I correct in thinking that's the release you used? Or was > > it a SVN revision near it? > > > > Thanks again, > > > > Nick > > > > On Wed, May 15, 2013 at 10:30:29AM +0100, Nick White wrote: > > > Hi Renard, > > > > > > This is awesome, great job :) > > > > > > I was interested to see what changes you'd made to tesseract, so > ran > > > 'diff -r' on the tesseract-ocr-3.02 directory in github, but a > quick > > > look made it seem quite different to the > > > tesseract-ocr-3.02.02.tar.gz currently available from Tesseract. > > > > > > Am I correct in thinking that? Is it based on a version from SVN? > If > > > so, which? If not, I'll just have to spend more time with diff ;-) > > > > > > I'd be keen to try and isolate and generalise any changes you made > > > and get them back into the core code, if I can. > > > > > > Thanks for all this lovely free code! > > > > > > Nick > > > > > > On Tue, May 14, 2013 at 01:51:15PM -0700, Renard Wellnitz wrote: > > > > Hi Tom, > > > > > > > > i decided to publish the code of the app under the Apache 2 > licence. > > However > > > > the c++ code that deals with image processing uses the stricter > GLP v3 > > since > > > > that is the place where i put a lot of effort into. > > > > > > > > The project still needs a readme and instructions on how to > build > the > > binaries. > > > > For someone with a bit of Android/NDK experience it should be > not > a big > > problem > > > > however. > > > > Readme and build instructions will follow in a couple of days. > > > > > > > > https://github.com/renard314/textfairy > > > > > > > > Cheers! > > > > Renard > > > > -- > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to tesser...@googlegroups.com > > To unsubscribe from this group, send email to > > tesseract-oc...@googlegroups.com > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > --- > > You received this message because you are subscribed to the Google > Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > email > > to tesseract-oc...@googlegroups.com. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to tesseract-ocr@googlegroups.com > To unsubscribe from this group, send email to > tesseract-ocr+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email > to tesseract-ocr+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
diff -r tesseract-ocr-r705/api/baseapi.cpp textfairy/tesseract-ocr-3.02/api/baseapi.cpp 34a35,37 > /* Version number of package */ > #define VERSION "3.02" > 36a40 > 849c853 < text = GetHOCRText(page_index); --- > text = GetHOCRText(NULL, page_index); 931a936,1044 > > > char* TessBaseAPI::GetHTMLText(const float minConfidenceToShowColor) { > if (page_res_ == NULL) { > return NULL; > } > int lcnt = 1, bcnt = 1, pcnt = 1, wcnt = 1; > > STRING html_str(""); > bool isItalic = false; > bool isBold = false; > > > ResultIterator *res_it = GetIterator(); > for (; !res_it->Empty(RIL_BLOCK); wcnt++) { > if (res_it->Empty(RIL_WORD)) { > res_it->Next(RIL_WORD); > continue; > } > > // Open any new block/paragraph/textline. > if (res_it->IsAtBeginningOf(RIL_BLOCK)) { > html_str +="<div>"; > } > if (res_it->IsAtBeginningOf(RIL_PARA)){ > html_str += "<p>"; > } > > // Now, process the word... > const char *font_name; > bool bold, italic, underlined, monospace, serif, smallcaps; > int pointsize, font_id; > font_name = res_it->WordFontAttributes(&bold, &italic, &underlined, > &monospace, &serif, &smallcaps, > &pointsize, &font_id); > bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD); > bool last_word_in_para = res_it->IsAtFinalElement(RIL_PARA, RIL_WORD); > bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD); > > float confidence = res_it->Confidence(RIL_WORD); > bool addConfidence = false; > if ( confidence<minConfidenceToShowColor && res_it->GetUTF8Text(RIL_WORD)!=" "){ > addConfidence = true; > html_str.add_str_int("<font conf='", (int)confidence); > html_str += "' color='#DE2222'>"; > } > > /* > if (!isBold && bold) { > html_str += "<em>"; > isBold = true; > } > */ > > if (!isItalic && italic) { > html_str += "<strong>"; > isItalic = true; > } > do { > const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL); > if (grapheme && grapheme[0] != 0) { > if (grapheme[1] == 0) { > switch (grapheme[0]) { > case '<': html_str += "<"; break; > case '>': html_str += ">"; break; > case '&': html_str += "&"; break; > case '"': html_str += """; break; > case '\'': html_str += "'"; break; > default: html_str += grapheme; break; > } > } else { > html_str += grapheme; > } > } > delete []grapheme; > res_it->Next(RIL_SYMBOL); > } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD)); > > if ((isItalic &&addConfidence==true) || (!italic && isItalic) || (isItalic && (last_word_in_block || last_word_in_para))){ > html_str += "</strong>"; > isItalic = false; > } > /* > if ((!bold && isBold) || (isBold && (last_word_in_block || last_word_in_para))){ > html_str += "</em>"; > isBold = false; > } > */ > if (addConfidence==true){ > html_str += "</font>"; > } > > html_str += " "; > > if (last_word_in_para) { > html_str += "</p>\n"; > pcnt++; > } > if (last_word_in_block) { > html_str += "</div>\n"; > bcnt++; > } > } > char *ret = new char[html_str.length() + 1]; > strcpy(ret, html_str.string()); > delete res_it; > return ret; > } > 938,940c1051,1052 < char* TessBaseAPI::GetHOCRText(int page_number) { < if (tesseract_ == NULL || < (page_res_ == NULL && Recognize(NULL) < 0)) --- > char* TessBaseAPI::GetHOCRText(struct ETEXT_DESC* monitor, int page_number) { > if (tesseract_ == NULL || (page_res_ == NULL && Recognize(monitor) < 0)) { 942c1054 < --- > } 944a1057 > float row_height, descenders, ascenders; 948c1061 < if (input_file_ == NULL) --- > if (input_file_ == NULL) { 949a1063 > } 953c1067 < hocr_str += input_file_ ? *input_file_ : "unknown"; --- > hocr_str += input_file_ ? *input_file_ : "android"; 982a1097,1101 > res_it->RowAttributes(&row_height,&descenders, &ascenders); > hocr_str.add_str_int("' font='", 15); > hocr_str.add_str_int("' size='", row_height); > hocr_str.add_str_int("' descenders='", descenders * -1); > hocr_str.add_str_int("' ascenders='", ascenders); 1010c1129 < default: hocr_str += grapheme; --- > default: hocr_str += grapheme; break; diff -r tesseract-ocr-r705/api/baseapi.h textfairy/tesseract-ocr-3.02/api/baseapi.h 494c494,498 < char* GetHOCRText(int page_number); --- > char* GetHOCRText(struct ETEXT_DESC* monitor, int page_number); > > char* GetHTMLText(const float minConfidenceToShowColor); > > diff -r tesseract-ocr-r705/ccmain/control.cpp textfairy/tesseract-ocr-3.02/ccmain/control.cpp 245c245,249 < monitor->progress = 30 + 50 * word_index / stats_.word_count; --- > monitor->progress = 70 * word_index / stats_.word_count; > if (monitor->progress_callback!=NULL){ > TBOX box = page_res_it.word()->word->bounding_box(); > (*monitor->progress_callback)(monitor->progress,box.left(), box.right(), box.top(), box.bottom()); > } 318c322,325 < monitor->progress = 80 + 10 * word_index / stats_.word_count; --- > monitor->progress = 70 + 30 * word_index / stats_.word_count; > if (monitor->progress_callback!=NULL){ > (*monitor->progress_callback)(monitor->progress,0,0,0,0); > } diff -r tesseract-ocr-r705/ccmain/ltrresultiterator.cpp textfairy/tesseract-ocr-3.02/ccmain/ltrresultiterator.cpp 163a164,171 > void LTRResultIterator::RowAttributes( float* row_height, > float* descenders, > float* ascenders) const{ > *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() - it_->row()->row->descenders(); > *descenders = it_->row()->row->descenders(); > *ascenders = it_->row()->row->ascenders(); > } > diff -r tesseract-ocr-r705/ccmain/ltrresultiterator.h textfairy/tesseract-ocr-3.02/ccmain/ltrresultiterator.h 112a113,114 > void RowAttributes(float* row_height, float* descenders, float* ascenders) const; > diff -r tesseract-ocr-r705/ccutil/ocrclass.h textfairy/tesseract-ocr-3.02/ccutil/ocrclass.h 110a111 > typedef bool (*PROGRESS_FUNC)(int progress, int left, int right, int top, int bottom ); 119a121 > PROGRESS_FUNC progress_callback;/*called whenever progress increases*/ Binary files tesseract-ocr-r705/tessdata/chi_sim.traineddata and textfairy/tesseract-ocr-3.02/tessdata/chi_sim.traineddata differ