Zdenko,

Thanks for making that fix. I am currently running tesseract from source on 
my computer. I've already made the fix on my source. However, if the fix 
were in an official release, then I could go back to using the officially 
released product.

I did find one other bug that I fixed locally in my tesseract code. Unless 
this other bug were also fixed in the official version, then I wouldn't be 
able to leave my custom code. Here are the bug details:

1)      In file boxread.cpp, function ReadAllBoxes(), we convert 
GenericVector<char> to const char* without a trailing ‘\0’. This can cause 
buffer read overrun inside the call to ReadMemBoxes(). To fix this, change 
function LoadDataFromFile() to always reserve an extra byte so the caller 
can add a ‘\0’ if they want. Then in ReadAllBoxes(), append ‘\0’ to the 
vector after calling LoadDataFromFile(). Here are the fixed functions:


inline bool LoadDataFromFile(const STRING& filename,
                             GenericVector<char>* data) {
  bool result = false;
  FILE* fp = fopen(filename.string(), "rb");
  if (fp != NULL) {
    fseek(fp, 0, SEEK_END);
    size_t size = ftell(fp);
    fseek(fp, 0, SEEK_SET);
    if (size > 0) {
      // reserve an extra byte in case caller wants to append a '\0' 
character
      data->reserve(size + 1);
      data->resize_no_init(size);
      result = fread(&(*data)[0], 1, size, fp) == size;
    }
    fclose(fp);
  }
  return result;
}

bool ReadAllBoxes(int target_page, bool skip_blanks, const STRING& filename,
                  GenericVector<TBOX>* boxes,
                  GenericVector<STRING>* texts,
                  GenericVector<STRING>* box_texts,
                  GenericVector<int>* pages) {
  GenericVector<char> box_data;
  if (!tesseract::LoadDataFromFile(BoxFileName(filename), &box_data))
    return false;
  box_data.push_back('\0');
  return ReadMemBoxes(target_page, skip_blanks, &box_data[0], boxes, texts,
                      box_texts, pages);
}



On Saturday, June 2, 2018 at 2:22:16 AM UTC-6, zdenop wrote:
>
> Please check if this is ok now. If yes, I am willing to make 3.05.02 
> release ;-)
>
> Zdenko
>
>
> so 2. 6. 2018 o 10:16 Zdenko Podobny <[email protected] <javascript:>> 
> napísal(a):
>
>> done in 
>> https://github.com/tesseract-ocr/tesseract/commit/bc5dfc4b953babcc865f68a55c3bf415f4280b1a
>> Zdenko
>>
>>
>> št 31. 5. 2018 o 22:39 shree <[email protected] <javascript:>> 
>> napísal(a):
>>
>>> This has been an issue for long. Thanks for finding the problem.
>>>
>>> Please submit a PR on github.
>>>
>>> On Friday, June 1, 2018 at 1:55:25 AM UTC+5:30, Paul Kitchen wrote:
>>>>
>>>> After a lot of stepping through tesseract code, I found the problem. 
>>>>
>>>> 1)      In file coutln.cpp, function C_OUTLINE::IsLegallyNested(), we 
>>>> assign outer_area() to an inT32, parent_area. Then lower in the function, 
>>>> we multiple child->outer_area() by parent_area. This caused an integer 
>>>> overflow which resulted in a bad sign for the multiplication. The fix was 
>>>> to make parent_area an inT64 so that integer overflow cannot happen.
>>>>
>>>>
>>>> The two 32-bit integers being multiplied were -51874 and 60218. The 
>>>> true result should be -3123748532 but the maximum result cannot be greater 
>>>> than 2^31 or you will have sign/overflow problems, which is the case here. 
>>>> The computer result was 1171218764, causing the if-statement to go down 
>>>> the 
>>>> wrong path.
>>>>
>>>> dfs
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a1b4da88-cb3f-4663-8ffd-d0c911e7b351%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to