Here is a sample of the problem it causes. I run the following to train the 
attached image and box file:

tesseract gdt.symbols.exp0.tif gdt.symbols.exp0 box.train

And here is the output:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

Page 1
Bad box coordinates in boxfile string! ²²²²▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌╦ÇÆ§≡¿←
APPLY_BOXES:
   Boxes read from boxfile:       7
   Found 7 good blobs.
Generated training data for 3 words

The message about the bad box coordinates is caused because function 
ReadMemBoxes() reads memory past the end of the const char* box_data 
parameter.

With the fix I suggested, this is the output:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
APPLY_BOXES:
   Boxes read from boxfile:       7
   Found 7 good blobs.
Generated training data for 3 words


On Monday, June 4, 2018 at 12:42:05 AM UTC-6, zdenop wrote:
>
> Paul,
>
> at the moment focus is on 4.0 release. But I understand that some user 
> still need/prefer to use 3.05.
>
> Can you create some test/demonstration case for you last bugfix? Is it not 
> fixed in 4.00...
>
> Zdenko
>
>
> ne 3. 6. 2018 o 4:03 Paul Kitchen <[email protected] 
> <javascript:>> napísal(a):
>
>> Zdenko,
>>
>> Thanks for making that fix. I am currently running tesseract from source 
>> on my computer. I've already made the fix on my source. However, if the fix 
>> were in an official release, then I could go back to using the officially 
>> released product.
>>
>> I did find one other bug that I fixed locally in my tesseract code. 
>> Unless this other bug were also fixed in the official version, then I 
>> wouldn't be able to leave my custom code. Here are the bug details:
>>
>> 1)      In file boxread.cpp, function ReadAllBoxes(), we convert 
>> GenericVector<char> to const char* without a trailing ‘\0’. This can cause 
>> buffer read overrun inside the call to ReadMemBoxes(). To fix this, change 
>> function LoadDataFromFile() to always reserve an extra byte so the caller 
>> can add a ‘\0’ if they want. Then in ReadAllBoxes(), append ‘\0’ to the 
>> vector after calling LoadDataFromFile(). Here are the fixed functions:
>>
>>
>> inline bool LoadDataFromFile(const STRING& filename,
>>                              GenericVector<char>* data) {
>>   bool result = false;
>>   FILE* fp = fopen(filename.string(), "rb");
>>   if (fp != NULL) {
>>     fseek(fp, 0, SEEK_END);
>>     size_t size = ftell(fp);
>>     fseek(fp, 0, SEEK_SET);
>>     if (size > 0) {
>>       // reserve an extra byte in case caller wants to append a '\0' 
>> character
>>       data->reserve(size + 1);
>>       data->resize_no_init(size);
>>       result = fread(&(*data)[0], 1, size, fp) == size;
>>     }
>>     fclose(fp);
>>   }
>>   return result;
>> }
>>
>> bool ReadAllBoxes(int target_page, bool skip_blanks, const STRING& 
>> filename,
>>                   GenericVector<TBOX>* boxes,
>>                   GenericVector<STRING>* texts,
>>                   GenericVector<STRING>* box_texts,
>>                   GenericVector<int>* pages) {
>>   GenericVector<char> box_data;
>>   if (!tesseract::LoadDataFromFile(BoxFileName(filename), &box_data))
>>     return false;
>>   box_data.push_back('\0');
>>   return ReadMemBoxes(target_page, skip_blanks, &box_data[0], boxes, 
>> texts,
>>                       box_texts, pages);
>> }
>>
>>
>>
>> On Saturday, June 2, 2018 at 2:22:16 AM UTC-6, zdenop wrote:
>>>
>>> Please check if this is ok now. If yes, I am willing to make 3.05.02 
>>> release ;-)
>>>
>>> Zdenko
>>>
>>>
>>> so 2. 6. 2018 o 10:16 Zdenko Podobny <[email protected]> napísal(a):
>>>
>>>> done in 
>>>> https://github.com/tesseract-ocr/tesseract/commit/bc5dfc4b953babcc865f68a55c3bf415f4280b1a
>>>> Zdenko
>>>>
>>>>
>>>> št 31. 5. 2018 o 22:39 shree <[email protected]> napísal(a):
>>>>
>>>>> This has been an issue for long. Thanks for finding the problem.
>>>>>
>>>>> Please submit a PR on github.
>>>>>
>>>>> On Friday, June 1, 2018 at 1:55:25 AM UTC+5:30, Paul Kitchen wrote:
>>>>>>
>>>>>> After a lot of stepping through tesseract code, I found the problem. 
>>>>>>
>>>>>> 1)      In file coutln.cpp, function C_OUTLINE::IsLegallyNested(), 
>>>>>> we assign outer_area() to an inT32, parent_area. Then lower in the 
>>>>>> function, we multiple child->outer_area() by parent_area. This caused an 
>>>>>> integer overflow which resulted in a bad sign for the multiplication. 
>>>>>> The 
>>>>>> fix was to make parent_area an inT64 so that integer overflow cannot 
>>>>>> happen.
>>>>>>
>>>>>>
>>>>>> The two 32-bit integers being multiplied were -51874 and 60218. The 
>>>>>> true result should be -3123748532 but the maximum result cannot be 
>>>>>> greater 
>>>>>> than 2^31 or you will have sign/overflow problems, which is the case 
>>>>>> here. 
>>>>>> The computer result was 1171218764, causing the if-statement to go down 
>>>>>> the 
>>>>>> wrong path.
>>>>>>
>>>>>> dfs
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a1b4da88-cb3f-4663-8ffd-d0c911e7b351%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/a1b4da88-cb3f-4663-8ffd-d0c911e7b351%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/28742331-e403-4335-a29e-7c412760211f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Attachment: gdt.symbols.exp0.box
Description: Binary data

Reply via email to