That looks right, thanks for that.

I'll try to take a proper look soon and figure out how best to
upstream stuff, and where it's worth doing so. In the meantime I'll
attach the .diff (very small; only 200 lines), in case anyone else
is interested, and so I don't forget ;)

Nick

On Wed, May 15, 2013 at 07:18:42AM -0700, Renard Wellnitz wrote:
> Hi Nick,
> 
> here is the console output:
> 
> 
>     localhost:tesseract-ocr-3.02 renard$  svn log -r COMMITTED 
>     ------------------------------------------------------------------------
>     r705 | zde...@gmail.com | 2012-03-15 22:05:12 +0100 (Thu, 15 Mar 2012) | 1
>     line
> 
>     fixed build in java directory; create documentation package with 'make
>     doc-pack'
>     ------------------------------------------------------------------------
> 
> 
> Cheers
> Renard 
> 
> 
> Am Mittwoch, 15. Mai 2013 14:28:35 UTC+2 schrieb Nick White:
> 
>     I'm no expert with SVN, but I think this command will tell me what I
>     want to know:
> 
>       svn log -r COMMITTED
> 
>     Thanks.
> 
>     On Wed, May 15, 2013 at 04:02:34AM -0700, Renard Wellnitz wrote:
>     > Hi Nick,
>     >
>     > i'm not really proficient with svn. Maybe this helps? If you want me to
>     run a
>     > specific svn command i'll gladly do it.
>     >
>     >
>     >     localhost:tesseract-ocr-3.02 renard$ svn ls "^/tags"
>     >     release-2.04/
>     >     release-3.00/
>     >     release-3.00.1/
>     >     release-3.01/
>     >     release-3.02.01/
>     >     release-3.02.02/
>     >     localhost:tesseract-ocr-3.02 renard$ svnversion .
>     >     705M
>     >     localhost:tesseract-ocr-3.02 renard$
>     >
>     >
>     > I do not remember the exact changes. But my main goals was the get
>     progress
>     > information during the ocr process so that my app could show the 
> bounding
>     boxes
>     > of the currently processed word.
>     >
>     > Cheers
>     > Renard
>     >
>     >
>     > Am Mittwoch, 15. Mai 2013 11:37:26 UTC+2 schrieb Nick White:
>     >
>     >     Ah, I see it's pretty close to 3.02.01 (now only available as an SVN
>     >     tag). Am I correct in thinking that's the release you used? Or was
>     >     it a SVN revision near it?
>     >
>     >     Thanks again,
>     >
>     >     Nick
>     >
>     >     On Wed, May 15, 2013 at 10:30:29AM +0100, Nick White wrote:
>     >     > Hi Renard,
>     >     >
>     >     > This is awesome, great job :)
>     >     >
>     >     > I was interested to see what changes you'd made to tesseract, so
>     ran
>     >     > 'diff -r' on the tesseract-ocr-3.02 directory in github, but a
>     quick
>     >     > look made it seem quite different to the
>     >     > tesseract-ocr-3.02.02.tar.gz currently available from Tesseract.
>     >     >
>     >     > Am I correct in thinking that? Is it based on a version from SVN?
>     If
>     >     > so, which? If not, I'll just have to spend more time with diff ;-)
>     >     >
>     >     > I'd be keen to try and isolate and generalise any changes you made
>     >     > and get them back into the core code, if I can.
>     >     >
>     >     > Thanks for all this lovely free code!
>     >     >
>     >     > Nick
>     >     >
>     >     > On Tue, May 14, 2013 at 01:51:15PM -0700, Renard Wellnitz wrote:
>     >     > > Hi Tom,
>     >     > >
>     >     > > i decided to publish the code of the app under the Apache 2
>     licence.
>     >     However
>     >     > > the c++ code that deals with image processing uses the stricter
>     GLP v3
>     >     since
>     >     > > that is the place where i put a lot of effort into.
>     >     > >
>     >     > > The project still needs a readme and instructions on how to 
> build
>     the
>     >     binaries.
>     >     > > For someone with a bit of Android/NDK experience it should be 
> not
>     a big
>     >     problem
>     >     > > however.
>     >     > > Readme and build instructions will follow in a couple of days.
>     >     > >
>     >     > > https://github.com/renard314/textfairy
>     >     > >
>     >     > > Cheers!
>     >     > > Renard
>     >
>     > --
>     > --
>     > You received this message because you are subscribed to the Google
>     > Groups "tesseract-ocr" group.
>     > To post to this group, send email to tesser...@googlegroups.com
>     > To unsubscribe from this group, send email to
>     > tesseract-oc...@googlegroups.com
>     > For more options, visit this group at
>     > http://groups.google.com/group/tesseract-ocr?hl=en
>     >  
>     > ---
>     > You received this message because you are subscribed to the Google 
> Groups
>     > "tesseract-ocr" group.
>     > To unsubscribe from this group and stop receiving emails from it, send 
> an
>     email
>     > to tesseract-oc...@googlegroups.com.
>     > For more options, visit https://groups.google.com/groups/opt_out.
>     >  
>     >  
> 
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>  
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email
> to tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


diff -r tesseract-ocr-r705/api/baseapi.cpp textfairy/tesseract-ocr-3.02/api/baseapi.cpp
34a35,37
> /* Version number of package */
> #define VERSION "3.02"
> 
36a40
> 
849c853
<       text = GetHOCRText(page_index);
---
>       text = GetHOCRText(NULL, page_index);
931a936,1044
> 
> 
> char* TessBaseAPI::GetHTMLText(const float minConfidenceToShowColor) {
> 		if (page_res_ == NULL) {
> 			return NULL;
> 		}
> 	  int lcnt = 1, bcnt = 1, pcnt = 1, wcnt = 1;
> 
> 	  STRING html_str("");
> 	  bool isItalic = false;
> 	  bool isBold = false;
> 
> 
> 	  ResultIterator *res_it = GetIterator();
> 	  for (; !res_it->Empty(RIL_BLOCK); wcnt++) {
> 	    if (res_it->Empty(RIL_WORD)) {
> 	      res_it->Next(RIL_WORD);
> 	      continue;
> 	    }
> 
> 	    // Open any new block/paragraph/textline.
> 	    if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
> 	    	html_str +="<div>";
> 	    }
> 	    if (res_it->IsAtBeginningOf(RIL_PARA)){
> 	    	html_str += "<p>";
> 	    }
> 
> 	    // Now, process the word...
> 	    const char *font_name;
> 	    bool bold, italic, underlined, monospace, serif, smallcaps;
> 	    int pointsize, font_id;
> 	    font_name = res_it->WordFontAttributes(&bold, &italic, &underlined,
> 	                                           &monospace, &serif, &smallcaps,
> 	                                           &pointsize, &font_id);
> 	    bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
> 	    bool last_word_in_para = res_it->IsAtFinalElement(RIL_PARA, RIL_WORD);
> 	    bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
> 
> 	    float confidence = res_it->Confidence(RIL_WORD);
> 		bool addConfidence = false;
> 		if (  confidence<minConfidenceToShowColor && res_it->GetUTF8Text(RIL_WORD)!=" "){
> 			addConfidence = true;
> 			html_str.add_str_int("<font conf='", (int)confidence);
> 			html_str += "' color='#DE2222'>";
> 		}
> 
> 		/*
> 		if (!isBold && bold) {
> 			html_str += "<em>";
> 			isBold = true;
> 		}
> 		*/
> 
> 	    if (!isItalic && italic) {
> 	    	html_str += "<strong>";
> 	    	isItalic =  true;
> 	    }
> 	    do {
> 	      const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
> 	      if (grapheme && grapheme[0] != 0) {
> 	        if (grapheme[1] == 0) {
> 	          switch (grapheme[0]) {
> 	            case '<': html_str += "&lt;"; break;
> 	            case '>': html_str += "&gt;"; break;
> 	            case '&': html_str += "&amp;"; break;
> 	            case '"': html_str += "&quot;"; break;
> 	            case '\'': html_str += "&#39;"; break;
> 	            default: html_str += grapheme; break;
> 	          }
> 	        } else {
> 	        	html_str += grapheme;
> 	        }
> 	      }
> 	      delete []grapheme;
> 	      res_it->Next(RIL_SYMBOL);
> 	    } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
> 
> 	    if ((isItalic &&addConfidence==true) || (!italic && isItalic) || (isItalic && (last_word_in_block || last_word_in_para))){
> 	    	html_str += "</strong>";
> 	    	isItalic = false;
> 	    }
> 	    /*
> 	    if ((!bold && isBold) || (isBold && (last_word_in_block || last_word_in_para))){
> 	    	html_str += "</em>";
> 	    	isBold = false;
> 	    }
> 	    */
> 		if (addConfidence==true){
> 			html_str += "</font>";
> 		}
> 
> 	    html_str += " ";
> 
> 	    if (last_word_in_para) {
> 	    	html_str += "</p>\n";
> 	    	pcnt++;
> 	    }
> 	    if (last_word_in_block) {
> 	    	html_str += "</div>\n";
> 	    	bcnt++;
> 	    }
> 	  }
> 	  char *ret = new char[html_str.length() + 1];
> 	  strcpy(ret, html_str.string());
> 	  delete res_it;
> 	  return ret;
> }
> 
938,940c1051,1052
< char* TessBaseAPI::GetHOCRText(int page_number) {
<   if (tesseract_ == NULL ||
<       (page_res_ == NULL && Recognize(NULL) < 0))
---
> char* TessBaseAPI::GetHOCRText(struct ETEXT_DESC* monitor, int page_number) {
>   if (tesseract_ == NULL || (page_res_ == NULL && Recognize(monitor) < 0)) {
942c1054
< 
---
>   }
944a1057
>   float row_height, descenders, ascenders;
948c1061
<   if (input_file_ == NULL)
---
>   if (input_file_ == NULL) {
949a1063
>   }
953c1067
<   hocr_str += input_file_ ? *input_file_ : "unknown";
---
>   hocr_str += input_file_ ? *input_file_ : "android";
982a1097,1101
>       res_it->RowAttributes(&row_height,&descenders, &ascenders);
>       hocr_str.add_str_int("' font='", 15);
>       hocr_str.add_str_int("' size='", row_height);
>       hocr_str.add_str_int("' descenders='", descenders * -1);
>       hocr_str.add_str_int("' ascenders='", ascenders);
1010c1129
<             default: hocr_str += grapheme;
---
>             default: hocr_str += grapheme; break;
diff -r tesseract-ocr-r705/api/baseapi.h textfairy/tesseract-ocr-3.02/api/baseapi.h
494c494,498
<   char* GetHOCRText(int page_number);
---
>   char* GetHOCRText(struct ETEXT_DESC* monitor, int page_number);
> 
>   char* GetHTMLText(const float minConfidenceToShowColor);
> 
> 
diff -r tesseract-ocr-r705/ccmain/control.cpp textfairy/tesseract-ocr-3.02/ccmain/control.cpp
245c245,249
<         monitor->progress = 30 + 50 * word_index / stats_.word_count;
---
>         monitor->progress = 70 * word_index / stats_.word_count;
>         if (monitor->progress_callback!=NULL){
>         	TBOX box = page_res_it.word()->word->bounding_box();
>         	(*monitor->progress_callback)(monitor->progress,box.left(), box.right(), box.top(), box.bottom());
>         }
318c322,325
<       monitor->progress = 80 + 10 * word_index / stats_.word_count;
---
>       monitor->progress = 70 + 30 * word_index / stats_.word_count;
>       if (monitor->progress_callback!=NULL){
>           	  (*monitor->progress_callback)(monitor->progress,0,0,0,0);
>       }
diff -r tesseract-ocr-r705/ccmain/ltrresultiterator.cpp textfairy/tesseract-ocr-3.02/ccmain/ltrresultiterator.cpp
163a164,171
> void LTRResultIterator::RowAttributes(	float* row_height,
> 										float* descenders,
> 										float* ascenders) const{
> 	  *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() - it_->row()->row->descenders();
> 	  *descenders = it_->row()->row->descenders();
> 	  *ascenders = it_->row()->row->ascenders();
> }
> 
diff -r tesseract-ocr-r705/ccmain/ltrresultiterator.h textfairy/tesseract-ocr-3.02/ccmain/ltrresultiterator.h
112a113,114
>   void RowAttributes(float* row_height, float* descenders, float* ascenders) const;
> 
diff -r tesseract-ocr-r705/ccutil/ocrclass.h textfairy/tesseract-ocr-3.02/ccutil/ocrclass.h
110a111
> typedef bool (*PROGRESS_FUNC)(int progress, int left, int right, int top, int bottom );
119a121
>   PROGRESS_FUNC progress_callback;/*called whenever progress increases*/
Binary files tesseract-ocr-r705/tessdata/chi_sim.traineddata and textfairy/tesseract-ocr-3.02/tessdata/chi_sim.traineddata differ

Reply via email to