Here is a sample of the problem it causes. I run the following to train the attached image and box file:
tesseract gdt.symbols.exp0.tif gdt.symbols.exp0 box.train And here is the output: Tesseract Open Source OCR Engine v3.05.00dev with Leptonica Page 1 Bad box coordinates in boxfile string! ²²²²▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌╦ÇÆ§≡¿← APPLY_BOXES: Boxes read from boxfile: 7 Found 7 good blobs. Generated training data for 3 words The message about the bad box coordinates is caused because function ReadMemBoxes() reads memory past the end of the const char* box_data parameter. With the fix I suggested, this is the output: Tesseract Open Source OCR Engine v3.05.00dev with Leptonica Page 1 APPLY_BOXES: Boxes read from boxfile: 7 Found 7 good blobs. Generated training data for 3 words On Monday, June 4, 2018 at 12:42:05 AM UTC-6, zdenop wrote: > > Paul, > > at the moment focus is on 4.0 release. But I understand that some user > still need/prefer to use 3.05. > > Can you create some test/demonstration case for you last bugfix? Is it not > fixed in 4.00... > > Zdenko > > > ne 3. 6. 2018 o 4:03 Paul Kitchen <[email protected] > <javascript:>> napísal(a): > >> Zdenko, >> >> Thanks for making that fix. I am currently running tesseract from source >> on my computer. I've already made the fix on my source. However, if the fix >> were in an official release, then I could go back to using the officially >> released product. >> >> I did find one other bug that I fixed locally in my tesseract code. >> Unless this other bug were also fixed in the official version, then I >> wouldn't be able to leave my custom code. Here are the bug details: >> >> 1) In file boxread.cpp, function ReadAllBoxes(), we convert >> GenericVector<char> to const char* without a trailing ‘\0’. This can cause >> buffer read overrun inside the call to ReadMemBoxes(). To fix this, change >> function LoadDataFromFile() to always reserve an extra byte so the caller >> can add a ‘\0’ if they want. Then in ReadAllBoxes(), append ‘\0’ to the >> vector after calling LoadDataFromFile(). Here are the fixed functions: >> >> >> inline bool LoadDataFromFile(const STRING& filename, >> GenericVector<char>* data) { >> bool result = false; >> FILE* fp = fopen(filename.string(), "rb"); >> if (fp != NULL) { >> fseek(fp, 0, SEEK_END); >> size_t size = ftell(fp); >> fseek(fp, 0, SEEK_SET); >> if (size > 0) { >> // reserve an extra byte in case caller wants to append a '\0' >> character >> data->reserve(size + 1); >> data->resize_no_init(size); >> result = fread(&(*data)[0], 1, size, fp) == size; >> } >> fclose(fp); >> } >> return result; >> } >> >> bool ReadAllBoxes(int target_page, bool skip_blanks, const STRING& >> filename, >> GenericVector<TBOX>* boxes, >> GenericVector<STRING>* texts, >> GenericVector<STRING>* box_texts, >> GenericVector<int>* pages) { >> GenericVector<char> box_data; >> if (!tesseract::LoadDataFromFile(BoxFileName(filename), &box_data)) >> return false; >> box_data.push_back('\0'); >> return ReadMemBoxes(target_page, skip_blanks, &box_data[0], boxes, >> texts, >> box_texts, pages); >> } >> >> >> >> On Saturday, June 2, 2018 at 2:22:16 AM UTC-6, zdenop wrote: >>> >>> Please check if this is ok now. If yes, I am willing to make 3.05.02 >>> release ;-) >>> >>> Zdenko >>> >>> >>> so 2. 6. 2018 o 10:16 Zdenko Podobny <[email protected]> napísal(a): >>> >>>> done in >>>> https://github.com/tesseract-ocr/tesseract/commit/bc5dfc4b953babcc865f68a55c3bf415f4280b1a >>>> Zdenko >>>> >>>> >>>> št 31. 5. 2018 o 22:39 shree <[email protected]> napísal(a): >>>> >>>>> This has been an issue for long. Thanks for finding the problem. >>>>> >>>>> Please submit a PR on github. >>>>> >>>>> On Friday, June 1, 2018 at 1:55:25 AM UTC+5:30, Paul Kitchen wrote: >>>>>> >>>>>> After a lot of stepping through tesseract code, I found the problem. >>>>>> >>>>>> 1) In file coutln.cpp, function C_OUTLINE::IsLegallyNested(), >>>>>> we assign outer_area() to an inT32, parent_area. Then lower in the >>>>>> function, we multiple child->outer_area() by parent_area. This caused an >>>>>> integer overflow which resulted in a bad sign for the multiplication. >>>>>> The >>>>>> fix was to make parent_area an inT64 so that integer overflow cannot >>>>>> happen. >>>>>> >>>>>> >>>>>> The two 32-bit integers being multiplied were -51874 and 60218. The >>>>>> true result should be -3123748532 but the maximum result cannot be >>>>>> greater >>>>>> than 2^31 or you will have sign/overflow problems, which is the case >>>>>> here. >>>>>> The computer result was 1171218764, causing the if-statement to go down >>>>>> the >>>>>> wrong path. >>>>>> >>>>>> dfs >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a1b4da88-cb3f-4663-8ffd-d0c911e7b351%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/a1b4da88-cb3f-4663-8ffd-d0c911e7b351%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/28742331-e403-4335-a29e-7c412760211f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
gdt.symbols.exp0.box
Description: Binary data

