Zdenko,
Thanks for making that fix. I am currently running tesseract from source on
my computer. I've already made the fix on my source. However, if the fix
were in an official release, then I could go back to using the officially
released product.
I did find one other bug that I fixed locally in my tesseract code. Unless
this other bug were also fixed in the official version, then I wouldn't be
able to leave my custom code. Here are the bug details:
1) In file boxread.cpp, function ReadAllBoxes(), we convert
GenericVector<char> to const char* without a trailing ‘\0’. This can cause
buffer read overrun inside the call to ReadMemBoxes(). To fix this, change
function LoadDataFromFile() to always reserve an extra byte so the caller
can add a ‘\0’ if they want. Then in ReadAllBoxes(), append ‘\0’ to the
vector after calling LoadDataFromFile(). Here are the fixed functions:
inline bool LoadDataFromFile(const STRING& filename,
GenericVector<char>* data) {
bool result = false;
FILE* fp = fopen(filename.string(), "rb");
if (fp != NULL) {
fseek(fp, 0, SEEK_END);
size_t size = ftell(fp);
fseek(fp, 0, SEEK_SET);
if (size > 0) {
// reserve an extra byte in case caller wants to append a '\0'
character
data->reserve(size + 1);
data->resize_no_init(size);
result = fread(&(*data)[0], 1, size, fp) == size;
}
fclose(fp);
}
return result;
}
bool ReadAllBoxes(int target_page, bool skip_blanks, const STRING& filename,
GenericVector<TBOX>* boxes,
GenericVector<STRING>* texts,
GenericVector<STRING>* box_texts,
GenericVector<int>* pages) {
GenericVector<char> box_data;
if (!tesseract::LoadDataFromFile(BoxFileName(filename), &box_data))
return false;
box_data.push_back('\0');
return ReadMemBoxes(target_page, skip_blanks, &box_data[0], boxes, texts,
box_texts, pages);
}
On Saturday, June 2, 2018 at 2:22:16 AM UTC-6, zdenop wrote:
>
> Please check if this is ok now. If yes, I am willing to make 3.05.02
> release ;-)
>
> Zdenko
>
>
> so 2. 6. 2018 o 10:16 Zdenko Podobny <[email protected] <javascript:>>
> napísal(a):
>
>> done in
>> https://github.com/tesseract-ocr/tesseract/commit/bc5dfc4b953babcc865f68a55c3bf415f4280b1a
>> Zdenko
>>
>>
>> št 31. 5. 2018 o 22:39 shree <[email protected] <javascript:>>
>> napísal(a):
>>
>>> This has been an issue for long. Thanks for finding the problem.
>>>
>>> Please submit a PR on github.
>>>
>>> On Friday, June 1, 2018 at 1:55:25 AM UTC+5:30, Paul Kitchen wrote:
>>>>
>>>> After a lot of stepping through tesseract code, I found the problem.
>>>>
>>>> 1) In file coutln.cpp, function C_OUTLINE::IsLegallyNested(), we
>>>> assign outer_area() to an inT32, parent_area. Then lower in the function,
>>>> we multiple child->outer_area() by parent_area. This caused an integer
>>>> overflow which resulted in a bad sign for the multiplication. The fix was
>>>> to make parent_area an inT64 so that integer overflow cannot happen.
>>>>
>>>>
>>>> The two 32-bit integers being multiplied were -51874 and 60218. The
>>>> true result should be -3123748532 but the maximum result cannot be greater
>>>> than 2^31 or you will have sign/overflow problems, which is the case here.
>>>> The computer result was 1171218764, causing the if-statement to go down
>>>> the
>>>> wrong path.
>>>>
>>>> dfs
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected] <javascript:>.
>>> To post to this group, send email to [email protected]
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com
>>>
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a1b4da88-cb3f-4663-8ffd-d0c911e7b351%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.