[tesseract-ocr] Re: preserving spaces

Aditya Singh Thu, 05 Sep 2019 04:50:23 -0700

The getuf8text function has been changed in baseapi.cpp as below:
/** Make a text string from the internal data structures. */
char* TessBaseAPI::GetUTF8Text() {
  if (tesseract_ == NULL ||
      (!recognition_done_ && Recognize(NULL) < 0))
    return NULL;
  STRING text("");
  ResultIterator *it = GetIterator();
  do {
    if (it->Empty(RIL_PARA)) continue;
    const std::unique_ptr<const char[]> 
para_text(it->GetUTF8Text(RIL_PARA));
    text += para_text.get();
  } while (it->Next(RIL_PARA));
  char* result = new char[text.length() + 1];
  strncpy(result, text.string(), text.length() + 1);
  delete it;
  return result;
}


So, there's no **ptr++=' ' to replace. Would be great if anyone can tell me 
how to go about this problem.

On Friday, October 26, 2018 at 7:47:43 PM UTC+5:30, 
[email protected] wrote:
>
> Hi, thanks for your answer, but where can I find the baseapi.cpp file ?
>
> On Monday, June 10, 2013 at 2:37:28 PM UTC+2, Nick White wrote:
>>
>> Hi Eric, 
>>
>> Thanks for this posting. Out of curiousity why do you need to 
>> preserve multiple spaces? 
>>
>> Do you think you could update the code to allow a new configuration 
>> variable? If you did, and posted the patch to the issues page, I 
>> expect it would be accepted, as this sounds like the sort of thing 
>> that is useful to be able to do. 
>>
>> Nick 
>>
>> On Sat, Jun 08, 2013 at 01:50:11PM -0700, [email protected] wrote: 
>> > I found the code Ray referred to back in '09. It is now in 
>> GetUTF8Text(). In 
>> > baseapi.cpp in TessBaseAPI::GetUTF8Text I changed: 
>> > 
>> >     *ptr++ = ' '; 
>> > 
>> > to 
>> > 
>> >     { 
>> >       int i ; 
>> >       for ( i = 0 ; i < word->word->space() ; i++ ) 
>> >         *ptr++ = ' '; 
>> >     } 
>> > 
>> > This added back in the multiple spaces as advertised. The results are a 
>> bit 
>> > unpredictable (as Ray warned back in '09). 
>> > 
>> > I'll keep poking at it. 
>> > 
>> > Eric 
>> >       
>> > 
>> > On Saturday, June 8, 2013 10:37:20 AM UTC-4, [email protected] wrote: 
>> > 
>> >     I need to maintain the (multiple) spaces in my output document. 
>> About 5 
>> >     years ago someone asked how to do this and Ray posted a suggestion. 
>> That 
>> >     suggestion does not appear to correspond to the current source 
>> code. 
>> > 
>> >     Can anyone suggest how I can maintain word spacing both before the 
>> first 
>> >     word on a line (indentation) as well as between words within a 
>> line? 
>> > 
>> >     I can force the text in the input image to have fixed spacing. 
>> > 
>> >     Ideally, there is a command line switch or a config item that will 
>> do what 
>> >     I need, but I am not averse to modifying the code if necessary. 
>> > 
>> >     Thanks, 
>> >     Eric 
>> > 
>> > 
>> > -- 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> > Groups "tesseract-ocr" group. 
>> > To post to this group, send email to [email protected] 
>> > To unsubscribe from this group, send email to 
>> > [email protected] 
>> > For more options, visit this group at 
>> > http://groups.google.com/group/tesseract-ocr?hl=en 
>> >   
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "tesseract-ocr" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email 
>> > to [email protected]. 
>> > For more options, visit https://groups.google.com/groups/opt_out. 
>> >   
>> >   
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/84195bd3-d984-4155-8511-fb86c01914f3%40googlegroups.com.

[tesseract-ocr] Re: preserving spaces

Reply via email to